Where State of the Art Fails

By mrkiouak@gmail.com on 2026-05-15

GenAI

Where State of the Art Fails: Automating data extraction and analysis of PDFs in 2026.

It's 2026. Nvidia, semiconductor, and DRAM producer stocks have risen by hundreds of percent, and I naïvely assumed that state-of-the-art frontier LLMs from the major providers could accurately read a PDF. This was wrong of every model tried — Gemini, Claude, and ChatGPT each failed to approach 70% accuracy on a ground-truth eval, and frequently got details wrong across a wide sample of PDFs.

Vermont town reports — what the document is

Vermont publishes its municipal finance in the open. Each town prints an annual town report ahead of Town Meeting Day — held the first Tuesday in March — and residents vote the operating budget, plus a slate of special articles, on the floor of that meeting. The town report is the document the vote is informed by. A typical 60–140-page report contains:

The warning: a numbered list of articles to be voted, including the operating-budget article that fixes the next fiscal year's appropriation.
A proposed budget broken out by fund (General Government, Highway, Library, etc.) and by line item, alongside prior-year actuals.
Tax-rate components: municipal rate per $100 of assessed value, state-set education rates, and the Common Level of Appraisal when it has been computed.
Audit, treasurer, and department narratives with prior-year actuals, fund balances, and selectboard commentary.

Layout varies — vector PDFs from Word or InDesign for roughly 80% of reports, scans for the remainder.

The first attempt: ask Gemini for the whole thing

I started with the obvious approach. Bind a Pydantic schema describing the entire budget document — funds, line items, tax rates, warrant articles, narrative reports — to a single Gemini 3.1 Pro call over the native PDF, set the thinking budget and output cap to Vertex's maximums, and ask the model to fill it in.

On Warren FY2026 — 78 pages, 28 funds, 248 line items in a hand-curated ground truth — Gemini returned six funds and 80 line items. Recall against ground truth: 31.1%. Output looked plausible at first glance: real fund names, real dollar amounts, the right schema. Roughly three-quarters of the document was missing from the output, silently.

Claude Opus 4.7 with the 1M-context beta failed the same task differently: it returned 33 fund records and then hit Anthropic's 32k output-token cap before producing any line items. GPT-5 wasn't tested.

The second attempt: locate sections, then extract them

The second attempt split the work in two. One Gemini Pro call classified each page — budget table, warrant article, departmental narrative, tax-rate disclosure — and emitted a section index. A second batch of per-section Pro calls then extracted line items inside each budget section.

I ran this across a wide range of Vermont town reports. Surface output often looked correct on inspection, which delayed discovering how flawed the underlying extraction actually was. On Warren, scored against the same 830-fact ground truth: 28 funds, 248 line items in the JSON, 66.9% recall, and a 50% reconciliation pass rate — half the funds had a sum of extracted line items disagreeing with the fund's own printed total by more than 2%. End-to-end cost ran $30–50 per town for a full Pro-with-thinking sweep, and debug iterations on a single problem town added another $50–100.

The recall was usable. The dollars were not.

Pause: read what the literature said

By this point I had spent enough on Vertex credits to justify checking whether anyone who actually studies document understanding had a better idea.¹ The 2025–2026 literature converges on three findings:

Document-specialized small VLMs have leapfrogged frontier general-purpose models on parsing accuracy. Sub-2B-parameter open-weights models — GLM-OCR (0.9B, OmniDocBench v1.5 composite 94.62), PaddleOCR-VL-1.5 (0.9B, 94.5), MinerU 2.5 (1.2B, top quartile), dots.ocr (1.7B, 87.5 EN) — now outscore Gemini 3 Pro (~90.3) and GPT-5.2 (~85.4) on OmniDocBench v1.5, and run on a single consumer GPU.² OmniDocBench v1.5 is approaching saturation; olmOCR-Bench (1,400 pages, ~7,000 binary unit-test facts) is the more discriminating 2026 successor.³
The field has converged on a "decoupled-VLM" architecture. MinerU 2.5, MonkeyOCR, PaddleOCR-VL, and Chandra all run (a) low-resolution layout and structure detection, then (b) native-resolution per-region recognition by a specialized small VLM, then (c) reading-order and relation prediction. The pattern dominates monolithic full-page VLMs (token explosion on dense pages) and classic CV → OCR → table-structure pipelines (stage-to-stage error propagation).⁴
Tables are the hardest universal sub-problem, and the place commercial APIs diverge. Reducto's RD-TableBench shows real complex tables separate commercial parsers by 25 points: Reducto 90.2%, Azure Document Intelligence 82.7%, AWS Textract 80.9%, Google Document AI 64.6%.⁵ Applied AI's 17-parser study (June 2026, 800+ real-world PDFs) found no single parser exceeds 88% edit similarity end-to-end, and parser accuracy varies 55+ points by domain. Hybrid stacks — specialized parser for tables, frontier VLM for figures and forms, vision retrieval for charts — consistently beat any single approach.⁶

The 2026 recommendation for mixed digital + scanned PDFs is therefore: per-page routing (embedded text layer for digital pages, rasterize + VLM for scans); a layout-aware parser emitting structured Markdown or HTML with preserved tables; structure-preserving chunking; and ColPali-style vision retrieval for figure-heavy corpora.

Reproducing the hybrid

The literature's recipe — specialized parser for tables, frontier VLM for figures and forms, retrieval separated — maps to a three-stage pipeline for this use case:

Stage 1 — layout-aware OCR. A commercial agentic-OCR API produces a per-page cell grid: deterministic table-structure recognition, reading order, and per-cell coordinates. Two backends were tested. Mistral OCR 3 charges $0.002 per page on the synchronous endpoint. Reducto charges $0.015 per credit, billing roughly 1.4 credits per page on Vermont budget PDFs, and is free under 15,000 credits per month — about 138 Warren-sized reports a month at zero marginal cost. Reducto beat Mistral in a follow-up A/B at 84.4% reconciliation pass rate against ground truth vs Mistral's 78.1%, consistent with the RD-TableBench gap.
Stage 2 — frontier-VLM recipe call. One Gemini 3.1 Pro vision call over the OCR'd document produces what the literature calls the layout/structure output: page-kind classification, fund identification, and a column-header-to-fiscal-year-kind mapping. This is the "frontier VLM for figures and forms" piece from the Applied AI hybrid recommendation, scoped narrowly to layout and structure rather than to row extraction.
Stage 3 — per-fund classifier. A per-fund Gemini 3.1 Flash-Lite classifier walks the cells inside each fund's recipe-defined page range and turns them into line items. A reconciliation retry loop re-runs the classifier if its summed line items don't match the fund's own printed total within ±2%, taking advantage of the fact that municipal budget tables print their own subtotals.

The architecture is precisely the decoupled-VLM pattern the literature names — layout detection (the Pro recipe call) followed by per-region recognition (the per-fund classifier) — with an arithmetic-reconciliation loop bolted on. Result on Warren FY2026, scored against the 830-fact ground truth:

Pipeline	Recall	Amount accuracy	Reconciliation pass	Lines	Cost per run	Wall
`e2e-pro` (attempt 1: single Gemini call)	31.1%	—	15.6%	80	$1.01	115s
`orig` (attempt 2: section locator + per-section reads)	66.9%	65.9%	50.0%	248	$30–50	30–90 min
`mid-docai-layout` (DocAI Layout Parser → one Pro call)	84.8%	—	62.5%	308	$0.40	532s
`mid-docai-gemini` (DocAI Gemini-3 Layout Parser → one Pro call)	91.1%	—	71.9%	310	$2.74	1072s
`hyb-llamaparse-pro-flash` (LlamaParse cells → hybrid downstream)	0%⁷	—	0%	0	$1.23	105s
`hyb-mistral-pro-flash` (Mistral OCR + Pro recipe + per-fund classifier)	96.9%	96.0%	78.1% (25/32)	292	$1.16	494s
`hyb-reducto-pro-flash` (Reducto OCR + Pro recipe + per-fund classifier)	95.5%	94.7%	84.4% (27/32)	385⁸	$2.62	1509s

Eval code and the Warren fixture are bundled at https://github.com/Rkiouak/mixed-pdf-extraction-eval. With GOOGLE_CLOUD_PROJECT and MISTRAL_API_KEY set, python -m eval.run_eval --towns warren --architectures hyb-mistral-pro-flash reproduces the headline row against the same 830-fact ground truth. Signup links for every other architecture's API are in the repo's README.

Takeaways

2026 definitely won't be the year of generial artificial intelligence. As a software developer, it was frankly shocking that the tool that can convert my natural language instructions into pulumi orchestrated AWS and GCP infrastructure hosting largely LLM generated code can't also read a budget table from a pdf. There are cases where LLMs have been very, very good at extracting narrative or bulleted semi structured natural language and number data for me -- but trying to work with these financial documents I found LLMs to be an utter failure -- and the current academic literature backs this up.

Miscellania

Two further results worth flagging. Google Document AI's Layout Parser feeding a single Pro call (mid-docai-layout) reaches 84.8% recall at $0.40 per run — cheap enough to justify on cost-sensitive backfills. The Gemini-3-backed DocAI parser (mid-docai-gemini) climbs to 91.1% but at $2.74 per run. The LlamaParse-fronted hybrid (hyb-llamaparse-pro-flash) emitted zero line items: LlamaParse's cell output composed poorly with the downstream Pro structuring call, the integration-boundary failure Applied AI's paper specifically warns about when single-vendor pipelines exceed their accuracy budget.

The recall numbers track the literature's prediction with no qualitative surprises. The headline gap — 96.9% on the SOTA-pattern hybrid versus 31.1% on a naive single-call versus 66.9% on a section-locator-plus-extractor pipeline — is the gap the 2026 surveys predicted for any single-vendor pipeline operating without a specialized table parser ahead of the VLM.

Metric definitions. Recall — atomic facts matched ÷ 830 ground-truth facts. Amount accuracy — matched facts where the candidate value is within ±$1 of ground truth ÷ 830. Reconciliation pass — % of funds where the sum of expenditure-flow line items at the target fiscal year equals the printed fund total within ±2%. Cost — actual API spend per Warren run, computed from each provider's billed usage rather than estimated. Wall — end-to-end seconds.

Footnotes

The State of Document Understanding for Mixed PDFs in 2026 — a 2025–2026 survey covering benchmarks, end-to-end VLMs, modular pipeline systems, commercial APIs, and the architecture patterns now considered SOTA across RAG, semantic search, Q&A, and structured information extraction. ↩
OmniDocBench v1.5 (CVPR 2025; 1,355 pages, 9 document types): https://arxiv.org/abs/2412.07626, https://github.com/opendatalab/OmniDocBench. Composite scores: GLM-OCR (Zhipu/Z.ai, 0.9B, MIT) https://huggingface.co/zai-org/GLM-OCR = 94.62; PaddleOCR-VL-1.5 (0.9B, Apache-2.0) https://arxiv.org/abs/2601.21957 = 94.5; FireRed-OCR ≈ 92.94; MinerU 2.5 (1.2B, Apache-2.0) https://arxiv.org/abs/2509.22186 = top quartile; dots.ocr (1.7B, Apache-2.0) https://github.com/rednote-hilab/dots.ocr = 87.5 EN / 84.0 ZH; Gemini 3 Pro ≈ 90.33; GPT-5.2 ≈ 85.4. ↩
LlamaIndex, Feb 2026: OmniDocBench is saturated — what's next for OCR benchmarks? https://www.llamaindex.ai/blog/omnidocbench-is-saturated-what-s-next-for-ocr-benchmarks. olmOCR-Bench (Poznanski et al. 2025): https://github.com/allenai/olmocr/tree/main/olmocr/bench, paper https://arxiv.org/abs/2510.19817. ↩
MonkeyOCR SRR triplet: https://arxiv.org/abs/2506.05218. MinerU 2.5: https://arxiv.org/abs/2509.22186. PaddleOCR-VL: https://arxiv.org/abs/2601.21957. Chandra OCR 2 (Datalab): https://github.com/datalab-to/chandra. ↩
Reducto RD-TableBench dataset: https://huggingface.co/datasets/reducto/rd-tablebench. Comparison numbers per Reducto's published methodology https://reducto.ai/. TableFormer (TEDS 98.5 simple / 95.0 complex on PubTabNet): https://arxiv.org/abs/2203.01017. ↩
Applied AI, PDF Parsing Benchmark, June 2026: https://www.applied-ai.com/briefings/pdf-parsing-benchmark/. 17 parsers tested across 800+ real-world PDFs; Gemini 3 Pro topped the field at 88% edit similarity; no parser exceeded 88%; LlamaParse rated best price/quality at $0.003/page; parser accuracy varied 55+ points by domain. ↩
The same Pro+CompactDoc structuring call works correctly when fed by Mistral OCR 3 or Reducto, so the failure sits at the LlamaParse-to-Pro integration boundary rather than in either component alone. ↩
Reducto's higher line count includes ~5 funds detected on the library and PTO operating pages that the v1 Warren ground truth explicitly excludes from scope. Precision on the in-scope subset is comparable to Mistral. ↩

Comments

Please login to post a comment.

No comments yet. Be the first to comment!