Research Notable

Vision Models vs. OCR on 30 Dense PDFs: What a 171-Question Benchmark Reveals

May 24, 2026 2 min read

What happens when you skip the OCR pipeline and hand a PDF directly to a vision model? A developer ran 171 questions across 30 long, image-heavy documents from the MMLongBench-Doc benchmark to test exactly this, using Claude Sonnet 4.5 as the reasoning model in both approaches.

Two Pipelines, One Hard Problem

The "attach and ask" approach is the path of least resistance. Upload the PDF, the model processes each page as an image, you ask questions. No preprocessing, no pipeline to maintain, no brittle extraction step that might break on unusual document layouts. The downside: reading page images is computationally heavier than reading structured text, and complex layouts can still trip up capable vision models.

OCR-based pipelines go the other direction. OCR (optical character recognition) software converts each page into machine-readable text first, then passes that text to the language model. OCR handles clean, text-heavy documents well - it's fast, reliable, and cheap to run. The problem is what happens to visual content: a detailed bar chart becomes a handful of stripped numbers with no surrounding context, a flowchart becomes nothing at all, a table with merged cells becomes garbled. The model working downstream never sees the visual structure - only whatever the OCR managed to extract.

For mostly-text documents, OCR has historically been the better choice. The interesting question is what happens when documents are genuinely visual - when charts, diagrams, and complex tables aren't decoration but actually contain the information being queried.

Why This Benchmark Stresses Both Approaches

MMLongBench-Doc is designed to push both pipelines hard. Documents in the dataset average well over 100 pages, and the questions require synthesizing information from multiple locations throughout - not finding a relevant paragraph, but connecting data scattered across dozens of pages. Results are reported as post-retry accuracy, meaning the model had multiple chances to answer each question correctly. That reflects real production systems better than single-shot evaluation, where one failed attempt counts as a failure regardless of whether a retry would have succeeded.

171 questions across 30 PDFs is a sample large enough to show consistent patterns without being impractical to reproduce.

What This Means for Document Pipelines

For teams processing PDFs professionally - legal, financial, research, technical documentation - the architecture choice has downstream consequences. If vision models reliably match or beat OCR on complex documents, the argument for simpler pipelines is real: fewer moving parts, no preprocessing failures, no degradation when documents don't fit OCR's assumptions about text layout.

If OCR still wins on text-heavy content, the better approach is routing by document type - OCR for clean prose-heavy PDFs, vision models for image-heavy ones. Some teams already do this, but it adds overhead that a single reliable pipeline would eliminate.

The full results, including breakdown by document type and question category, are in the MMLongBench-Doc repository. The dataset is publicly available for anyone running comparisons across other models.

Two Pipelines, One Hard Problem

Why This Benchmark Stresses Both Approaches

What This Means for Document Pipelines

Related Tools

More from today

The Multi-Agent Memory Problem: Why Long Projects Degrade Over Weeks

Claude Code Cache Misses Cost 12.5x More Than Hits - Here's the Math

The Case Against Uncensored Local LLMs for Most Builders

Cookie Preferences