Comparing Parsee Document Loader vs. Langchain Document Loaders for PDFs
With the datasets in this folder we want to test how the results of an LLM for extracting structured data from invoices differs for different document loaders.
Both datasets have their own Readme's with more info about the methodology, notebooks for the creation of the dataset and evaluation results:
1. Invoice Dataset - Langchain Loader
parsee-core version used: 0.1.3.11
This dataset was created on the basis of 15 sample invoices (PDF files).
All PDF files are publicly accessible on parsee.ai, to access them copy the "source_identifier" (first column) and paste it in this URL (replace '{SOURCE_IDENTIFIER}' with the actual identifier):
https://app.parsee.ai/documents/view/{SOURCE_IDENTIFIER}
So for example:
https://app.parsee.ai/documents/view/1fd7fdbd88d78aa6e80737b8757290b78570679fbb926995db362f38a0d161ea
The invoices were selected randomly and are in either German or English.
The following code was used to create the dataset: jupyter notebook
The correct answers for each row were loaded from Parsee Cloud, where they were checked by a human and corrected prior to running this code.
1.1 LLM Evaluation
For the evaluation we are using the mistralai/mixtral-8x7b-instruct-v0.1 model from replicate.
The results of the evaluation can be found here: jupyter notebook
1.2 Result
Even though the Parsee PDF Reader was not initially designed for invoices (which have often quite fractured text pieces and tables that are difficult to structure properly), it is still able to outperform the langchain PyPDF reader with a total accuracy of 88% vs. 82% for the langchain reader.
2. Revenues Dataset - Parsing Tables
This dataset consists of 15 pages from annual/quarterly reports of German companies (PDF files), the filings are in English though.
The goal is to evaluate two things:
How well can a state-of-the-art LLM retrieve complex structured information from the documents?
How does the Parsee.ai document loader fare against the langchain PyPDF loader for this document type
We are using the Claude 3 Opus model for all runs here, as this was the most capable model in our prior experiments (beating GPT 4).
Both datasets have their own Readme's with more info about the methodology, notebooks for the creation of the dataset and evaluation results:
2.1 Result
Explanation of results:
Completeness: This measures how often the model gave the expected amount of answers. For example for this file, there are 5 columns with a "Revenue" figure in them. So we are expecting the model to return 5 different "answers", each with one of the revenue figures (you can see these in the tab "Extracted Data" on Parsee Cloud)
Revenues Correct: How many times the model extracted a valid "Revenues" figure. If the answer was missing completely, this is counted here as well (so this both accounts for wrong answers, and also missing answers)
Revenues Correct (excluding missing answers): This is disregarding the cases where the model simply did not extract the right figure at all, so basically, if it extracted the figure (matched based on the meta information), was it the correct number?
Meta Items Correct: How many times did the model extract all the expected meta information (time periods, currencies etc.; missing answers are counted here as well)
Meta Items Correct (excluding missing answers): If the model found a valid revenues number, how many times was all the meta information attached to it correct? (this is not counting the times where the answer was missing entirely)
- GPT-4o Benchmark Results Showing that it is Truly a Next-Generation ModelWe tested the performance of GPT-4 Omni (model name: gpt-4o) on our finRAG dataset, the results show that this is truly a next generation model that does not seem to have some common issues that previous generation models had, making it possibly the first model suitable for reliable enterprise use.
- Data ExtractionfinRAG Datasets & StudyWe wanted to investigate how good the current state of the art (M)LLMs are at solving the relatively simple problem of extracting revenue figures from publicly available financial reports. To test this, we created 3 different datasets, all based on the same selection of 1,156 randomly selected annual reports for the year 2023 of publicly listed US companies. The resulting datasets contain a combined total of 10,404 rows, 37,536,847 tokens and 1,156 images. For our study, we are evaluating 8 state-of-the-art (M)LLMs on a subset of 100 reports.