Back To Blog
Data Extraction

Extraction Templates vs. Prompt Templates

February 29, 2024 - 5 min read
frame-1321315523
Exploring the advantages of Parsee extraction templates over simple prompt templates.

Why not just prompt templates?

While prompt templates (such as „Classify the following text into the following categories…“) can get the work done for simple extraction tasks, they are lacking in a few crucial areas:

Limited to LLMs

The idea of Parsee’s extraction templates is to define a format that is not just limited to the use with LLMs, but can be used also to train other model types and run predictions with these

Messy for More Complex Extraction/Structuring Task

While prompt templates can work well for simple tasks such as „What is the invoice total?“ it gets very quickly very messy when also wanting to extract currencies, units, time periods etc. along with some „main“ piece of information.

Type Safety

While you can tell an LLM to return data in a specific format, there is no „guarantee“ it will do that. We try to parse the data according to the defined data type and if the parsing is not successful, we return a „null“ value, to ensure that the output data is really exactly in the right format

Open Source Framework Data Extraction and Structuring

Try Parsee Cloud for free

Explore Parsee Cloud's Document Processing Capabilities at No Cost
Related posts
  • GPT-4o Benchmark Results Showing that it is Truly a Next-Generation Model
    We tested the performance of GPT-4 Omni (model name: gpt-4o) on our finRAG dataset, the results show that this is truly a next generation model that does not seem to have some common issues that previous generation models had, making it possibly the first model suitable for reliable enterprise use.
  • Data Extraction
    finRAG Datasets & Study
    We wanted to investigate how good the current state of the art (M)LLMs are at solving the relatively simple problem of extracting revenue figures from publicly available financial reports. To test this, we created 3 different datasets, all based on the same selection of 1,156 randomly selected annual reports for the year 2023 of publicly listed US companies. The resulting datasets contain a combined total of 10,404 rows, 37,536,847 tokens and 1,156 images. For our study, we are evaluating 8 state-of-the-art (M)LLMs on a subset of 100 reports.