April 23, 2026•1 min read•from Machine Learning

We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]

Our take

In our recent benchmarking of 18 large language models (LLMs) for optical character recognition (OCR), we discovered that many teams are overpaying for advanced models while legacy solutions often perform just as well at a fraction of the cost. By testing 42 standard documents across 7,560 calls, we found that smaller and older models can achieve premium accuracy without the hefty price tag. Our findings, along with an open-source framework and free testing tool, are available to help you optimize your OCR workflows.

TLDR; We were overpaying for OCR, so we compared flagship models with cheaper and older models. New mini-bench + leaderboard. Free tool to test your own documents. Open Source.

We’ve been looking at OCR / document extraction workflows and kept seeing the same pattern:

Too many teams are either stuck in legacy OCR pipelines, or are overpaying badly for LLM calls by defaulting to the newest/ biggest model.

We put together a curated set of 42 standard documents and ran every model 10 times under identical conditions; 7,560 total calls. Main takeaway: for standard OCR, smaller and older models match premium accuracy at a fraction of the cost.

We track pass^n (reliability at scale), cost-per-success, latency, and critical field accuracy.

Everything is open source: https://github.com/ArbitrHq/ocr-mini-bench

Leaderboard: https://arbitrhq.ai/leaderboards/

Curious whether this matches what others here are seeing.

submitted by /u/TimoKerre
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

DharmaOCR: Open-Source Specialized SLM (3B) + Cost–Performance Benchmark against LLMs and other open-sourced models [R]Hey everyone, we just open-sourced DharmaOCR on Hugging Face. Models and datasets are all public, free to use and experiment with. We also published the paper documenting all the experimentation behind it, for those who want to dig into the methodology. We fine-tuned open-source SLMs (3B and 7B parameters) using SFT + DPO and ran them against GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6, Google Document AI, and open-source alternatives like OlmOCR, Deepseek-OCR, GLMOCR, and Qwen3. - The specialized models came out on top: 0.925 (7B) and 0.911 (3B). - DPO using the model's own degenerate outputs as rejected examples cut the failure rate by 87.6%. - AWQ quantization drops per-page inference cost ~22%, with insignificant effect on performance. Models & datasets: https://huggingface.co/Dharma-AI Full paper: https://arxiv.org/abs/2604.14314 Paper summary: https://gist.science/paper/2604.14314 submitted by /u/augusto_camargo3 [link] [comments]

We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]

Related Articles