1 min readfrom Machine Learning

We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]

Our take

In our recent benchmarking of 18 large language models (LLMs) for optical character recognition (OCR), we discovered that many teams are overpaying for advanced models while legacy solutions often perform just as well at a fraction of the cost. By testing 42 standard documents across 7,560 calls, we found that smaller and older models can achieve premium accuracy without the hefty price tag. Our findings, along with an open-source framework and free testing tool, are available to help you optimize your OCR workflows.

TLDR; We were overpaying for OCR, so we compared flagship models with cheaper and older models. New mini-bench + leaderboard. Free tool to test your own documents. Open Source.

We’ve been looking at OCR / document extraction workflows and kept seeing the same pattern:

Too many teams are either stuck in legacy OCR pipelines, or are overpaying badly for LLM calls by defaulting to the newest/ biggest model.

We put together a curated set of 42 standard documents and ran every model 10 times under identical conditions; 7,560 total calls. Main takeaway: for standard OCR, smaller and older models match premium accuracy at a fraction of the cost.

We track pass^n (reliability at scale), cost-per-success, latency, and critical field accuracy.

Everything is open source: https://github.com/ArbitrHq/ocr-mini-bench

Leaderboard: https://arbitrhq.ai/leaderboards/

Curious whether this matches what others here are seeing.

submitted by /u/TimoKerre
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#natural language processing for spreadsheets#generative AI for data analysis#rows.com#Excel alternatives for data analysis#financial modeling with spreadsheets#automation in spreadsheet workflows#large dataset processing#OCR#LLMs#open source#leaderboard#benchmark#document extraction#mini-bench#legacy pipelines#standard documents#cost-per-success#reliability at scale#curated set#latency