Data Extraction
finRAG Datasets & Study
We wanted to investigate how good the current state of the art (M)LLMs are at solving the relatively simple problem of extracting revenue figures from publicly available financial reports. To test this, we created 3 different datasets, all based on the same selection of 1,156 randomly selected annual reports for the year 2023 of publicly listed US companies.
The resulting datasets contain a combined total of 10,404 rows, 37,536,847 tokens and 1,156 images.
For our study, we are evaluating 8 state-of-the-art (M)LLMs on a subset of 100 reports.