Back To Blog
Data Extraction

Extraction Templates

March 12, 2024 - 5 min
frame-1321315523
The core functionality of the Parsee Extraction Templates explained.

In the following, we will only focus on the "general questions" items of the extraction templates. The logic for table detection/structuring items is quite similar and we will add some more explanations for them in the future.

Base Logic

In the most basic sense, every question you define under the "general questions" category can have exactly one answer.

If no answer can be found for a question (or meta item), the answer can always be „n/a“, meaning that the parsing of the values was not successful or the model did not have an answer. In that sense, all outputs are "nullable" but will be represented by the string value "n/a" in case they are null.

A question can have more than one answer only when there is a meta item defined, which will create an „axis“ along which the model can give different answers to the same question.

Example

For the question: What is the invoice total? (output type numeric)

If there is no meta item defined, the model can answer this question only with one number (because of the numeric output type) or with "n/a".

e.g.

  • Invoice total: 10.0

OR

  • Invoice total: 21.5

OR

  • Invoice total: n/a

In case there are maybe several different valid answers to a question for a single document, you can add a meta item, which will help to further structure the answer.

If we define a meta item „invoice date“ and attach it to the invoice total, the model can now theoretically give several answers for the same document, differentiated by their meta ID:

e.g.

  • (first answer) Invoice total: 10.0 as per 2022-03-01

AND

  • (second answer) Invoice total: 24.0 as per 2022-06-01

So you can imagine the meta items as a sort of „key“, in the sense that as long as the meta values differ for 2 items, their keys will be different. All output values can be imagined as key value pairs.

Open Source Framework Data Extraction and Structuring

Try Parsee Cloud for free

Explore Parsee Cloud's Document Processing Capabilities at No Cost
Related posts
  • Data Extraction
    finRAG Datasets & Study
    We wanted to investigate how good the current state of the art (M)LLMs are at solving the relatively simple problem of extracting revenue figures from publicly available financial reports. To test this, we created 3 different datasets, all based on the same selection of 1,156 randomly selected annual reports for the year 2023 of publicly listed US companies. The resulting datasets contain a combined total of 10,404 rows, 37,536,847 tokens and 1,156 images. For our study, we are evaluating 8 state-of-the-art (M)LLMs on a subset of 100 reports.
  • Data Extraction
    Comparing Parsee Document Loader vs. Langchain Document Loaders for PDFs
    In the following we will be comparing the results of the Parsee Document Loader vs. the PyPDF Langchain Document Loader for various datasets. All datasets that are used here can be found on Huggingface (links below), so the results are all reproducible.