Tutorials

In the following we will give you an overview of the Parsee Python package, so that you can start exploring our document processing solution.

Full code for tutorials can be found on Github

Installation

Recommended way of installing is with poetry: https://python-poetry.org/


poetry add parsee-core

Alternatively, you can also use pip:


pip install parsee-core

1) Basic Example

In this example we will show the basic functionality of Parsee. Core concepts to learn here are:

Extraction Templates, Output Types, Meta Items, Document Loaders, Models


import os

from parsee.templates.helpers import StructuringItem, MetaItem, create_template
from parsee.extraction.models.helpers import gpt_config, replicate_config
from parsee.converters.main import load_document
from parsee.extraction.run import run_job_with_single_model
from parsee.utils.enums import *

# Step 1: create an extraction template
question_to_be_answered = "What is the invoice total?"
output_type = OutputType.NUMERIC

meta_currency_question = "What is the currency?"
meta_currency_output_type = OutputType.LIST # we want the model to use a pre-defined item from a list, this is basically a classification
meta_currency_list_values = ["USD", "EUR", "Other"] # any list of strings can be used here

meta_item = MetaItem(meta_currency_question, meta_currency_output_type, list_values=meta_currency_list_values)

invoice_total = StructuringItem(question_to_be_answered, output_type, meta_info=[meta_item])

# let's also define an item for the issuer of the invoice
invoice_issuer = StructuringItem("Who is the issuer of the invoice?", OutputType.ENTITY)

job_template = create_template([invoice_total, invoice_issuer])

# Step 2: define a model
# requires an API key from replicate: https://replicate.com/
replicate_api_key = os.getenv("REPLICATE_KEY")
replicate_model = replicate_config(replicate_api_key, "mistralai/mixtral-8x7b-instruct-v0.1")

# Step 3: load a document
file_path = "../tests/fixtures/Midjourney_Invoice-DBD682ED-0005.pdf" # modify file path here (use absolute file paths if possible), for this example we are using one of the example files included in this repo
document = load_document(file_path)

# Step 4: run the extraction
_, _, answers_open_source_model = run_job_with_single_model(document, job_template, replicate_model)

# let's see if some other model can also predict the right answer
open_ai_api_key = os.getenv("OPENAI_KEY") # enter your key manually here instead of loading from an .env file
gpt_model = gpt_config(open_ai_api_key)

_, _, answers_gpt = run_job_with_single_model(document, job_template, gpt_model)

2) Table Detection and Structuring


import os

from parsee.templates.helpers import TableItem, MetaItem, create_template
from parsee.extraction.models.helpers import gpt_config, replicate_config
from parsee.converters.main import load_document
from parsee.extraction.run import run_job_with_single_model
from parsee.utils.enums import *

# TABLE EXTRACTION

# in this example we want to fully structure the data from a table
# this process is split in different parts, as this leads to the best performance in our experience
# The steps are the following: 1) Detect the relevant table(s) 2) Structure meta info for each column and 3) map the rows to standardized 'buckets' if needed

meta_currency_question = "What is the currency?"
meta_currency_output_type = OutputType.LIST # we want the model to use a pre-defined item from a list, this is basically a classification
meta_currency_list_values = ["USD", "EUR", "Other"] # any list of strings can be used here
meta_currency = MetaItem(meta_currency_question, meta_currency_output_type, list_values=meta_currency_list_values)

meta_date_question = "What is the date the period is ending in?"
meta_date_output_type = OutputType.DATE
meta_date = MetaItem(meta_date_question, meta_date_output_type)

table_item = TableItem("Profit & Loss Statement", "Revenues, Cost of goods sold, operating income, net profit, Financial statements", [meta_currency, meta_date])

job_template = create_template(None, [table_item])

# define a model
# requires an API key from replicate: https://replicate.com/
replicate_api_key = os.getenv("REPLICATE_KEY")
replicate_model = replicate_config(replicate_api_key, "mistralai/mixtral-8x7b-instruct-v0.1")

# load a document
file_path = "../tests/fixtures/bayer_filing2.pdf" # modify file path here (use absolute file paths if possible), for this example we are using one of the example files included in this repo
document = load_document(file_path)

# Step 4: run the extraction
_, column_output, _ = run_job_with_single_model(document, job_template, replicate_model)

print(column_output)

3) Datasets


"""
A dataset requires to connect features (which in case of LLMs, are basically the prompts) with output values,
such that a model can learn the relation between the two.
In the following we will illustrate this process for LLMs.
You can also use Parsee Cloud (at https://app.parsee.ai) to easily create datasets and label/correct data in a graphical user interface.
Once we have a dataset, we can also run comparisons between different models (see tutorial 6).
"""
from parsee.templates.helpers import StructuringItem, MetaItem, create_template
from parsee.extraction.extractor_dataclasses import AssignedAnswer, AssignedMeta
from parsee.converters.main import load_document
from parsee.datasets.main import create_dataset_rows
from parsee.datasets.writers.disk_writer import CsvDiskWriter
from parsee.utils.enums import *

# Let's use the invoice example again, with the two questions: invoice total and issuer of invoice
meta_currency = MetaItem("What is the currency?", OutputType.LIST, list_values=["USD", "EUR", "Other"])
invoice_total = StructuringItem("What is the invoice total?", OutputType.NUMERIC, meta_info=[meta_currency])
invoice_issuer = StructuringItem("Who is the issuer of the invoice?", OutputType.ENTITY)
job_template = create_template([invoice_total, invoice_issuer])

# Let's use two different documents to create a datasets
first_doc = load_document("./tests/fixtures/documents/pdf/Midjourney_Invoice-DBD682ED-0005.pdf")
second_doc = load_document("./tests/fixtures/documents/pdf/INV-CF12005.pdf")

# We can assign the correct values using the AssignedAnswer and AssignedMeta classes. For these, we have to provide IDs of our questions.
# Let's start with the first document
# Let's assign a value for our invoice total question
question_id = invoice_total.id
# Looking at the document, the correct answer is '11.90'. All answers are strings here.
correct_answer = "11.90"
# The currency is USD -> here we have to use the ID of the meta item, not the 'main' question. You can also modify the IDs using the 'assigned_id' property when you create the object.
currency_assigned = AssignedMeta(meta_currency.id, "USD")
# For training a model based on a dataset, it is also better to provide the used 'sources' for each item, so that the model can also improve in returning these. For this example we will omit the sources for simplicity.
invoice_total_answer_first_doc = AssignedAnswer(question_id, correct_answer, [currency_assigned], [])
# Let's create an answer for the invoice issuer question also
invoice_issuer_answer_first_doc = AssignedAnswer(invoice_issuer.id, "Midjourney Inc", [], [])
# Let' repeat the same for the second doc
invoice_total_answer_second_doc = AssignedAnswer(invoice_total.id, "5570.40", [currency_assigned], [])
# Let's create an answer for the invoice issuer question also
invoice_issuer_answer_second_doc = AssignedAnswer(invoice_issuer.id, "CloudFactory International Limited UK", [], [])

# in our dataset, each of these assigned answers is basically one row. We can now easily create a dataset with prompts and the assigned answers:
# by default, we limit the number of tokens to 4k, but you can modify this value (this is independent of the model used).
# If you provide a source for the assigned answers, parsee will also check that the source is really contained in the transformed document after applying the token limit. If not, no row will be returned (this is to make sure the model can only learn on samples where the answer is actually in the text and not cut off)
token_limit = 4000
dataset_rows = create_dataset_rows(job_template, first_doc, [invoice_total_answer_first_doc, invoice_issuer_answer_first_doc], max_tokens_prompt=token_limit)
# let's add the rows from the second document also
dataset_rows += create_dataset_rows(job_template, second_doc, [invoice_total_answer_second_doc, invoice_issuer_answer_second_doc], max_tokens_prompt=token_limit)
# to save the rows as csv, we can create a dataset writer
writer = CsvDiskWriter("/Users/thomasflassbeck/Desktop/temp/x")
# write the rows at the target destination as CSV
writer.write_rows(dataset_rows, "questions_invoice")

# in the next tutorial we will show how to evaluate different models on the dataset we just created


4) Model Evaluations


import os
"""
To evaluate a model or to compare several models with each other in terms of performance, we need a Parsee dataset.
You can create a simple dataset manually as shown in the previous tutorial.
You can also run extractions for one or several documents on Parsee Cloud (https://app.parsee.ai),
correct and see the output in a graphical user interface, and then create datasets from there.
"""
from parsee.datasets.evaluation.main import evaluate_llm_performance
from parsee.datasets.readers.disk_reader import SimpleCsvDiskReader
from parsee.templates.helpers import StructuringItem, MetaItem, create_template
from parsee.extraction.models.helpers import gpt_config, replicate_config
from parsee.utils.enums import *

# Let's first use the dataset we created in the previous example and run it for two different models
dataset_path = "/Users/thomasflassbeck/Desktop/temp/x/dataset_cf611191-2aa6-4c7f-8e53-777d29b92634/questions_invoice.csv"

# Let's use the same extraction template (would be better of course to save it to Parsee Cloud and load from there)
meta_currency = MetaItem("What is the currency?", OutputType.LIST, list_values=["USD", "EUR", "Other"])
invoice_total = StructuringItem("What is the invoice total?", OutputType.NUMERIC, meta_info=[meta_currency])
invoice_issuer = StructuringItem("Who is the issuer of the invoice?", OutputType.ENTITY)
job_template = create_template([invoice_total, invoice_issuer])

# let's create a dataset reader
reader = SimpleCsvDiskReader(dataset_path)

# let's define the models we want to evaluate
open_ai_api_key = os.getenv("OPENAI_KEY") # enter your key manually here instead of loading from an .env file
gpt_model = gpt_config(open_ai_api_key)
replicate_api_key = os.getenv("REPLICATE_KEY")
replicate_model = replicate_config(replicate_api_key, "mistralai/mixtral-8x7b-instruct-v0.1")
replicate_model2 = replicate_config(replicate_api_key, "mistralai/mistral-7b-v0.1")
replicate_model3 = replicate_config(replicate_api_key, "mistralai/mistral-7b-instruct-v0.2")

# let's run predictions with several models
performance = evaluate_llm_performance(job_template, reader, [replicate_model, replicate_model2, replicate_model3])

print(performance)