Back To Blog
GPT-4o Benchmark Results Showing that it is Truly a Next-Generation Model
May 15, 2024 - 5 min
We tested the performance of GPT-4 Omni (model name: gpt-4o) on our finRAG dataset, the results show that this is truly a next generation model that does not seem to have some common issues that previous generation models had, making it possibly the first model suitable for reliable enterprise use.
By Thomas Flassbeck (Twitter | Linkedin) from Parsee.ai.
Results
The methodology is the same as for our previously published study, we only added the results for GPT-4 Omni (gpt-4o). GPT-4o achieves new highscores on all evaluations and is the first model where the vision capabilities seem to be on par with the text-based processing capabilities:
Related posts
- Data ExtractionfinRAG Datasets & StudyWe wanted to investigate how good the current state of the art (M)LLMs are at solving the relatively simple problem of extracting revenue figures from publicly available financial reports. To test this, we created 3 different datasets, all based on the same selection of 1,156 randomly selected annual reports for the year 2023 of publicly listed US companies. The resulting datasets contain a combined total of 10,404 rows, 37,536,847 tokens and 1,156 images. For our study, we are evaluating 8 state-of-the-art (M)LLMs on a subset of 100 reports.
- Data ExtractionComparing Parsee Document Loader vs. Langchain Document Loaders for PDFsIn the following we will be comparing the results of the Parsee Document Loader vs. the PyPDF Langchain Document Loader for various datasets. All datasets that are used here can be found on Huggingface (links below), so the results are all reproducible.