finRAG Datasets & Study
By Thomas Flassbeck (Twitter | Linkedin) from Parsee.ai.
finRAG Datasets
We are publishing 3 datasets:
“Selection-text”: this dataset contains only the relevant profit & loss statement with the revenue numbers that we are looking for. It can be considered our “base-case”, as extracting the revenue numbers from this table only should be the easiest (8.5M tokens).
“RAG-text”: this dataset tries to simulate a real-world RAG-application, where we chunk the original document into pieces, perform a vector search based on the question that we want to answer, and present the LLMs with the most relevant chunks. We cut off all prompts at 8k tokens for this exercise, so in case the relevant table was not contained in the prompt, we inserted it at the “first position”, to simulate a “happy path” for the vector search, as the goal of this study is not to examine how good or bad vector search is working, but rather to focus on the capabilities of the LLMs if we can guarantee that all required information to solve a task is presented to the model (28M tokens).
“Selection-image”: this dataset is similar to the “Selection-text” dataset in the sense that we feed to the models only an image of the relevant profit & loss statement, that contains all the necessary information to solve the problem (1M tokens and 1,156 images).
All data can be found both on Github and Huggingface.
Major Results:
If we feed the models just the relevant table in text form (“selection-text” dataset), which contains all the required information to solve the problem, a lot of the state-of-the-art LLMs we tested can solve the task with near 100% accuracy, although some even struggle with this most simple exercise (notably Databricks DBRX and Snowflake Arctic, which is by far the worst model we tested for our task).
RAG is still a major problem for ALL models as even at a relatively small context size of 8k which we used for the “RAG-text” dataset, performance drops significantly for all models, between 10-30% as compared to the “selection-text” dataset.
State of the art vision models (Claude 3 and Chat GPT 4 vision) are performing much worse than their pure-textual counterparts, achieving only around 60% accuracy on the “selection-image” dataset, which is a drop of almost 40% compared to the “selection-text” dataset
Looking at the results of the “Selection-text” dataset, especially the smaller models (e.g. Llama 3 70b, Command R plus) are showing a significant drop in performance with a more complex prompt (and more complex expected answer), compared to a more precisely posed question (and simpler answer), whereas the more complex prompt and expected answer was not a problem for the leading proprietary models (Claude 3, Mistral Large and Chat GPT 4). So for smaller models it is still important to keep the task as “easy” as possible, and better to use multiple prompts rather than trying to do everything in one prompt/request.
Introduction
The market for “enterprise-grade” (M)LLMs is gaining new competitors on a weekly basis and as such, models are being marketed at being already excellent at solving real-world problems for a variety of use-cases. We decided to put the current state of the art (M)LLMs to the test and evaluate their capabilities in the domain of simple data extraction from financial reports. We imagined the use-case of a financial analyst wanting to retrieve the revenue numbers of a company, based on the most recent annual report. This task typically involves finding the profit & loss statement inside the report, finding the row with the “Total Revenues” (or similarly named), and then returning the number from one or all columns, depending on the task (if all numbers should be extracted or just numbers from a single year, e.g. 2023). We also always need to return a unit (thousands, millions, billions or “none”) and a currency, as this information is also crucial to being able to work with the numbers.
This task can be considered “easy” for a human, and anyone with basic financial knowledge will be able to solve the task with near 100% accuracy given the annual report of the company. We wanted to evaluate the models in-depth on a relatively easy task, in order to fully understand what they are capable of and where their current weaknesses lie.
Methodology
We created two versions of the data, in V1 we explored the best prompts for our task and in V2 we only preceded with the “winning” prompts from V1. For both versions (V1 and V2) we created 3 different datasets:
“Selection-text”: this dataset contains only the relevant profit & loss statement with the revenue numbers that we are looking for. It can be considered our “base-case”, as extracting the revenue numbers from this table only should be the easiest.
“RAG-text”: this dataset tries to simulate a real-world RAG-application, where we chunk the original document into pieces, perform a vector search based on the question that we want to solve, and present the LLMs with the chunks. We cut off all prompts at 4k tokens for the V1 dataset and at 8k tokens for the V2 dataset, so in case the relevant table was not contained in the prompt, we inserted it at the “first position”, to simulate a “happy path” for the vector search, as the goal of this study is not to examine how good or bad vector search is working, but rather to focus on the capabilities of the LLMs if we can guarantee that all required information to solve a task is presented to the model.
“Selection-image”: this dataset is similar to the “Selection-text” dataset in the sense that we feed to the models only an image of the relevant profit & loss statement, that contains all the necessary information to solve the problem.
For both V1 and V2 datasets, we set up separate “extraction templates”, using the parsee-core Python library (more info about the extraction templates can be found here). The parsee-core library enables us to create prompts (based on the questions we are asking, more on these below) which have been battle-tested and also makes the evaluation of the results easy, as all answers of the models are parsed into a format that allows for direct comparison with our manually “assigned” labels. The evaluation methods in the parsee-core library also don’t just assign a right or wrong value to an answer, but rather allow for more subtle comparisons, as each answer is scored based on the correctness of the “main” answer and also the meta info returned (time periods, currencies, units in our case). For example, let’s say we want to extract the 2023 revenue numbers from a financial report and the revenues of a company are EUR 104M for 2023. If the model returns the 104M but also says the currency is USD, while actually the currency is EUR, we still count the “main” answer as correct (as the correct number was returned), but the “meta” score will be 0 for this question. For our study, we chose a weight of ¾ for the “main” question, and ¼ of the meta items, so in total in this case the model would have a score of 75% for this question, given that the wrong currency was returned but the number was correct.
The full prompts contain by default an example, so this is basically a one-shot exercise.
V1 Dataset
Setup
We created a first version of the dataset where we tried to explore which prompt format is the best for our task of extracting the revenues figure from the original filings. The V1 dataset contained therefore 6 different “questions”, worded slightly differently and with different expected output values. For instance we wanted to know if we would ask the model to return the number in a specific unit, if it would be able to do so (applying some necessary addition of zeros or removal of decimal places).
The questions of the V1 dataset were the following:
What are the revenues for the 12 month period ending December 2023 (in thousand USD)? (ID: rev23_thousands_no_hint)
What are the revenues for the 12 month period ending December 2023 (in thousand USD)? Please double check your answer and make sure the unit of the number is in thousands, format the number if necessary (add or remove digits). (ID: rev23_thousands_hint)
What are the revenues for the 12 month period ending December 2023 (in million USD)? (ID: rev23_millions_no_hint)
What are the revenues for the 12 month period ending December 2023? [meta items: unit, currency] (ID: rev23_meta)
What are the revenues for the 12 month period ending December 2022 (in million USD)? (ID: rev22_millions_no_hint)
What are the revenues of the company? [meta items: unit, currency, time period in months, period ending date] (ID: rev_meta)
Note: the actual prompts incorporate these “questions” but are of course more complex, you can see the full prompts in the dataset files.
The full extraction template can be seen here visually also (requires free log-in).
For questions 1-3 (rev23_thousands_no_hint, rev23_millions_no_hint, rev23_millions_no_hint) we are asking for the revenue numbers of a specific year (2023) and in a specific unit (thousands or millions). For question 2 we are also giving a hint that the number might have to be adjusted in order to return the number in the requested format.
For question 4 we are not asking for a specific unit and currency but rather want to extract the unit and currency as meta-info (this is done with meta items in a Parsee extraction template).
For question 5 we are asking for the 2022 revenues instead of the 2023 numbers, as all reports are from 2023, it should be a little “harder” to find the number, as it is not in the first column usually.
For question 6 (rev_meta) we did not specify a unit or specific year, but rather want to extract all the available revenue figures and their units, time periods etc.
Results
We ran the extraction using the parsee-core library for a selection of 50 annual reports and with a token limit of 4k for all datasets. The 4k tokens are only relevant for the RAG-text dataset, as the other 2 datasets should never touch the 4k token limit anyway.
For the evaluation of the results we also used the parsee-core library but added a custom function for comparing the results, which would basically reflect the unit chosen by the model and check if the model prediction and the correct value differ by more than 0.1% (full code is on Github). So for example, if a model replies (simplified): {“unit”: “millions”, “main_question”: 1400}
And the “assigned” (correct) value is {“unit”: “thousands”, “main_question”: 1399000}, this would still evaluate as “true”, as we compare 1,400,000,000 with 1,399,000,000 and the difference is only 0.07%. As discussed before, the total scores are always also considering the “meta items” if present (see beginning of “Methodology” section).
The results are summarized in a Jupyter Notebook.
Summary of results by dataset and task:
Looking at the “Selection-text” dataset only (in the table), we can clearly see that the question with the highest accuracy for almost all models is “rev23_meta”. Only for Llama 3 70B, the rev23_meta question is not the highest ranking. Surprisingly also, Llama 3 is beating Claude 3 and ChatGPT 4 at the rev22_millions_no_hint question. As Llama 3 is the only “outlier” here, we decided to proceed with the rev23_meta question, which is not asking for a specific unit but rather requires the model to extract the correct unit and return it as part of the answer. It seems also that even state of the art models like Claude 3 or ChatGPT 4 struggle with returning the right unit, if the unit requested differs from the one present in the document. For most filings the unit is in millions, so asking for the answer in thousands (e.g. rev23_thousands_no_hint) leads to an accuracy of only ~80% for Claude 3 and ChatGPT 4, almost 20% lower than for the rev23_meta question which is not asking for a specific unit.
We are not going to discuss the results of the V1 run further, as we adjusted the extraction template to only use questions in the format of rev23_meta, as opposed to asking for a specific unit. There are also some smaller edge cases in the data for the V1 run (basically some pro-forma restatements which can lead to two possible numbers for the same time-period), which lead to overall a lower performance than on the V2 dataset, especially for the Selection-text datasets. In the V2 run we made sure that no such cases are present.
V2 Dataset
Setup
The V2 dataset is the one we uploaded in full also to Github and Huggingface (with 1k+ annual reports).
In the V2 dataset we are asking the following three questions, 2 are taken from V1 and one new one was added, to improve the comparison of the results:
What are the revenues of the company? [meta items: unit, currency, time period in months, period ending date] (ID: rev_meta)
What are the revenues for the 12 month period ending December 2023? [meta items: unit, currency] (ID: rev23_meta)
What are the revenues for the 12 month period ending December 2022? [meta items: unit, currency] (ID: rev22_meta) [NEW]
Note: the actual prompts incorporate these “questions” but are of course more complex, you can see the full prompts in the dataset files.
The full extraction template can be seen here visually also (requires free log-in).
Models Used
For the V2 run we compared the following models:
Claude 3 Opus (id: claude-3-opus-20240229)
ChatGPT 4 (id: gpt-4-1106-preview)
Llama 3 70B via replicate (id: meta/meta-llama-3-70b-instruct)
Mixtral 8x22B Instruct via together.ai (id: mistralai/Mixtral-8x22B-Instruct-v0.1)
Mistral Large (id: mistral-large-latest)
Databricks DBRX Instruct via together.ai (id: databricks/dbrx-instruct)
Cohere Command R Plus via cohere.com (id: command-r-plus)
Snowflake Arctic Instruct (id: Snowflake/snowflake-arctic-instruct)
Results
We ran the extraction using the parsee-core library for a selection of 100 annual reports and with a token limit of 8k for all datasets. The 8k tokens are only relevant for the RAG-text dataset, as the other 2 datasets should never touch the 8k token limit anyway.
For the evaluation of the results we again use the custom compare function, as discussed in the V1 results.
The results are summarized in a Jupyter Notebook.
Summary of results by dataset and task:
Selection-text Dataset
We can see that 3 of the models can solve the exercise in theory with almost 100% accuracy (green bars in the chart). These models are: Claude 3 Opus, Mistral Large and ChatGPT 4. Afterwards we have Llama 3 as best open & available weights model, with more than 80% accuracy. Afterwards the accuracy gradually declines and reaches a low with ~30% for Snowflake Arctic.
RAG-text Dataset
For all models we see a significant drop in performance when the 8k context is fully used. The drop is less noticeable for some models (for Claude 3 and Command R Plus around 10%), but some other models are heavily deteriorating with the addition of other elements to the data. As such, Mistral Large’s accuracy drops by almost 20% and Llama 3 accuracy drops even by more than 30%.
Selection-image Dataset
For the selection image datasets we can also see that the vision models are not up-to par with the pure text models, as the accuracy for both Claude 3 and ChatGPT is almost halved, as compared to the pure-text case.
Conclusion
No model can achieve a total accuracy of over 90% on this “relatively simple” task (on the RAG-text dataset, as this is the closest to our real world use-case we wanted to test), with Claude 3 Opus being the leader at 87%, followed by ChatGPT 4 at 82%. The only other model that managed to achieve relatively good results is Mistral Large, at an overall score of 76%. Command R Plus is barely suited for our task, with an overall score of 67% and high variance depending on how the question is asked. Llama 3 is giving good results when data from a single column is supposed to be retrieved, but is scoring the lowest from all models in extracting all columns correctly (only 13% accuracy), which drags down its overall score to 51%. DBRX Instruct is clearly not suited for RAG in the case we tested with an overall score of 25%. Snowflake Arctic is by far though the worst model we tested, as even for the easiest dataset (Selection-text), where state of the art models score close to 100%, it only achieved a score of 28% (vs. 53% for DBRX Instruct, the second worst model in our tests), making it basically unusable for our test-case.
Does this mean that LLMs are not ready for the enterprise yet? Well, it depends. For our task (precise data extraction), training any model (not limited to LLMs) with 90+% accuracy is not easy at all, so having a model that out of the box can almost achieve 90% accuracy is already quite a feat. Then again, the task can be solved by an intern with very basic financial knowledge with 100% accuracy, so we are also still far away from AGI levels. There is also the problem of “finding” the necessary piece of information, as the 90% accuracy does not factor in cases where the vector search (or any other search method) is not retrieving the correct chunks. This can of course partly be mitigated by larger contexts of the models, but as we can see looking at the RAG-text results vs. the Selection-text results, increasing the context leads to a significant drop in performance for all models still. It would be interesting for further research to compare the results depending on the number of tokens used in more detail (e.g. 8k vs. 16k vs. 100k vs. 200k). In a future study we will also examine the costs associated with running the models and incorporate that into our results.
- GPT-4o Benchmark Results Showing that it is Truly a Next-Generation ModelWe tested the performance of GPT-4 Omni (model name: gpt-4o) on our finRAG dataset, the results show that this is truly a next generation model that does not seem to have some common issues that previous generation models had, making it possibly the first model suitable for reliable enterprise use.
- Data ExtractionComparing Parsee Document Loader vs. Langchain Document Loaders for PDFsIn the following we will be comparing the results of the Parsee Document Loader vs. the PyPDF Langchain Document Loader for various datasets. All datasets that are used here can be found on Huggingface (links below), so the results are all reproducible.