Open Source Framework for Data Extraction and Structuring

Use LLMs and other custom models to transform PDFs, HTML files, and images into fully structured data. Full support for multimodal queries.

Open Source framework for data structuring
Create extraction templates and run jobs locally
Fully hosted version and visual user interface, no coding knowledge required

Core Features

Parsee aims to be a simple, opinionated framework for easily extracting and structuring data from the most common sources of unstructured data.

You can start extracting data instantly by using LLMs, which can serve as "universal" models. Once you have some data extracted, you can create datasets and train custom, task-specific AI-models that might outperform LLMs in accuracy and cost-efficiency.

Custom Extraction Templates
Extraction Templates
Define type-safe extraction templates, that guarantee that your data will be parsed exactly into the format you require.
Model Agnostic
Model Agnostic
Use any LLM (ChatGPT, open source models, etc.) or even non-LLM models side-by-side, depending on the task
create-answers
Create Datasets
Create Datasets to train or evaluate models, either LLMs or custom models that are tailored to your task
Compare AI Models
Compare Models
Compare the performance of a range of LLMs and custom models on your datasets

Parsee Cloud

Easily create & share extraction templates on Parsee Cloud, run extraction jobs & share the results with your team members. Usable without any technical knowledge. Usage based, 100% transparent billing, start with $5 of free credits.
Total of documents uploaded to Parsee cloud and extraction jobs executed.

Extraction Templates Example

Create type-safe extraction templates with ease and save them to Parsee Cloud. Alternatively, you can create the templates in the visual editor on Parsee Cloud and load them locally, too.


# Define questions in free text form
question = "What is the invoice total?"

# Define output types
output_type = OutputType.NUMERIC

# Define meta items: information that is associated with the main question we are asking, such as time periods, currencies, units etc.
meta_currency = "What is the currency?"
meta_currency_output_type = OutputType.LIST
meta_item = MetaItem(meta_currency, meta_currency_output_type, list_values=["USD", "EUR", "Other"])

invoice_total = StructuringItem(question, output_type, meta_info=[meta_item])

job_template = create_template([invoice_total])

# Optional: save template to Parsee Cloud
cloud = ParseeCloud("YOUR_KEY")
template_id = cloud.save_template(job_template)

# Or load any template from Parsee Cloud (templates are shareable in your organisation)
template = cloud.get_template(template_id)

Find the template defined in this example on Parsee Cloud

Load documents and run jobs locally or in Parsee Cloud

Extract structured data easily with the Parsee Document Loaders and a wide selection of pre-defined models:


 # Use default loader to determine the document type automatically:
document = load_document("../tests/fixtures/Midjourney_Invoice-DBD682ED-0005.pdf")

# you can also build a custom document converter of course if needed for your use-case

# define a model, here you can use all open source models from Replicate for example: https://replicate.com/
replicate_api_key = os.getenv("REPLICATE_KEY")
replicate_model = replicate_config(replicate_api_key, "mistralai/mixtral-8x7b-instruct-v0.1")

# run the extraction using an extraction template (see steps above)
_, _, answers = run_job_with_single_model(document, job_template, replicate_model)

answers[0].class_value
>> 11.9
answers[0].meta[0].class_value
>> 'USD'

Full tutorials can be found on Github