LLM Datasets for RAG

Our financial data datasets were compiled with the data we collected at SimFin (parent company of parsee.ai) over the last 8 years. The basis are over 300k quarterly and annual financial reports that are openly accessible (therefore not subject to any copyright limitations). We extracted the data with our custom models that were created in a several year long iterative process of labeling and correcting data. Now we transformed our data using Parsee and are making the results available to companies that are interested in a very large, high quality dataset for RAG applications, specifically for cases where LLMs have to understand a source document such as a PDF, and extract structured data from it.

We have 4 different financial datasets available (covering different tasks), all directly ready for use in training custom LLM models with the full prompt and the expected (correct) answer that an LLM can give.

The datasets can be delivered with the full prompts (containing all data of the source document basically) or with prompts capped to e.g. 4k tokens (or any other number).

For capping the prompts, we can either apply a vector search for the most relevant elements or include a random number of fragments „before“ and „after“ the true source which is required to answer the question. If the prompts are capped via vector search, we filter out rows where the correct „source“ (passage of text inside document which is vital for answering the question) got cut out, making sure that the LLM is only trained on samples where all the necessary information is in the text, as the vector search together with a token limit does not guarantee that the necessary information to answer the question is contained in the prompt.

Instead of prompts, we can also deliver images of the source document for multimodal models.

Dataset samples can be found on Huggingface: https://huggingface.co/parsee-ai

To illustrate our extremely thorough approach to dataset creation, validation and testing, you can check out our dedicated Github repository: https://github.com/parsee-ai/parsee-datasets

In total the datasets have several million rows and are several terabytes big. Pricing is per row so you can start with a smaller fraction and get more data on request. For a detailed offer, please contact sales@parsee.ai