Open Source Framework for Data Extraction and Structuring
Use LLMs and other custom models to transform PDFs, HTML files, and images into fully structured data. Full support for multimodal queries.
Core Features
Parsee aims to be a simple, opinionated framework for easily extracting and structuring data from the most common sources of unstructured data.
You can start extracting data instantly by using LLMs, which can serve as "universal" models. Once you have some data extracted, you can create datasets and train custom, task-specific AI-models that might outperform LLMs in accuracy and cost-efficiency.
Parsee Cloud
Extraction Templates Example
Create type-safe extraction templates with ease and save them to Parsee Cloud. Alternatively, you can create the templates in the visual editor on Parsee Cloud and load them locally, too.
# Define questions in free text form
question = "What is the invoice total?"
# Define output types
output_type = OutputType.NUMERIC
# Define meta items: information that is associated with the main question we are asking, such as time periods, currencies, units etc.
meta_currency = "What is the currency?"
meta_currency_output_type = OutputType.LIST
meta_item = MetaItem(meta_currency, meta_currency_output_type, list_values=["USD", "EUR", "Other"])
invoice_total = StructuringItem(question, output_type, meta_info=[meta_item])
job_template = create_template([invoice_total])
# Optional: save template to Parsee Cloud
cloud = ParseeCloud("YOUR_KEY")
template_id = cloud.save_template(job_template)
# Or load any template from Parsee Cloud (templates are shareable in your organisation)
template = cloud.get_template(template_id)
Load documents and run jobs locally or in Parsee Cloud
Extract structured data easily with the Parsee Document Loaders and a wide selection of pre-defined models:
# Use default loader to determine the document type automatically:
document = load_document("../tests/fixtures/Midjourney_Invoice-DBD682ED-0005.pdf")
# you can also build a custom document converter of course if needed for your use-case
# define a model, here you can use all open source models from Replicate for example: https://replicate.com/
replicate_api_key = os.getenv("REPLICATE_KEY")
replicate_model = replicate_config(replicate_api_key, "mistralai/mixtral-8x7b-instruct-v0.1")
# run the extraction using an extraction template (see steps above)
_, _, answers = run_job_with_single_model(document, job_template, replicate_model)
answers[0].class_value
>> 11.9
answers[0].meta[0].class_value
>> 'USD'