PDF parsing remains unsolved: LlamaParse, Reducto, Unstructured, Marker
Two years into the production RAG era, the single biggest blocker for most enterprise deployments isn't the LLM, the vector store, or the retrieval algorithm. It's whether your PDF parser got the table right. The vendors have multiplied; the problem hasn't been solved.
The dirty secret of production RAG, two years after every vendor and their cousin shipped a “RAG-in-a-box” product, is that the single biggest blocker for most enterprise deployments is not the LLM, not the embedder, not the vector store, and not the retrieval algorithm. It is whether your PDF parser correctly extracted the numbers from the table on page 47 of a 10-K filing.
When that table parses wrong — when a number gets dropped, or a column header gets associated with the wrong row, or two adjacent columns get interleaved into a single mess — every downstream component is operating on garbage. The reranker still works. The embeddings are still dense. The LLM is still smart. And the answer to “what was the goodwill impairment in Q3?” is confidently wrong.
This post is the field report on the PDF parsing landscape in 2026. The vendors have multiplied. The benchmarks have improved. The problem is, in honest terms, still not solved — and the right question is which trade-off matches your workload, not which vendor won.
Why PDFs are this hard
PDFs are a print-layout format pretending to be a content format.
When a human reads a multi-column page with a footnote and a
sidebar table, the reading order is obvious. When a machine reads
the same page, the underlying PDF object stream has no notion of
“this paragraph continues in the next column” or “this footnote
attaches to this sentence.” It’s a soup of (text "blah" at x, y)
operators, often in arbitrary order, sometimes with the text
encoded as raw glyph paths if the PDF was generated from an
image-based source.
The failure modes that production RAG teams see, in roughly the order they cause pain:
- Tables get interleaved. A two-column table reads as alternating rows from left and right columns, smushed into prose.
- Multi-column text reads out of order. A two-column journal page returns column 1, column 2 alternating line by line, instead of column 1 then column 2.
- Footnotes attach to the wrong sentence. The footnote at the bottom of the page lands in the middle of the prose chunk where it was first referenced.
- Headers and footers contaminate every chunk. “Page 47 of ACME Corp 10-K 2024” becomes a token in every embedding.
- Scanned PDFs need OCR, and OCR introduces its own errors. Handwriting, low-quality scans, and unusual fonts all break.
- Charts and images get lost entirely. A bar chart’s numbers live in the chart, not in the text — and most parsers ignore it.
Each of these failure modes has a different fix, and the vendors differ on which they prioritize.
The vendor landscape, honestly
Five names dominate production conversations in 2026. The differences between them are real, but smaller than the marketing suggests:
LlamaParse — the convenience option
LlamaIndex’s LlamaParse is the parser most RAG teams reach for first. It’s well-integrated into the LlamaIndex ecosystem, has structured output options (markdown, JSON), supports custom parsing instructions, and is fast — community benchmarks consistently report ~6 seconds per page on simple documents. The pricing is per-page with a generous free tier, and the “premium mode” handles tables and figures respectably on standard layouts.
Where it gets weak: multi-column layouts can interleave, and financial-document table extraction is competitive but not class-leading on the hardest cases. Reducto’s public benchmark claims LlamaParse leaves accuracy on the table on dense financial content.
Reducto — vision-first, accuracy-claimed
Reducto is the most aggressive accuracy play in the market. Vision-first pipeline, fine-tuned models for table extraction, layout-aware reading-order detection. Their benchmark reports a ~20% accuracy lead over LlamaParse on real-world enterprise documents, particularly on financial filings and dense tables.
The catch is cost and throughput. Reducto is more expensive per page and slower per document than LlamaParse, and the gap matters on workloads with millions of pages. Reducto’s positioning is explicitly “if you’re doing compliance or financial-document RAG and accuracy is non-negotiable, the price is justified.”
Unstructured — the open-source baseline
Unstructured is the most-installed PDF parser in the ecosystem and the default starting point for anyone who isn’t paying for a hosted parser. It handles a wide range of document types, has a permissive license, and is well-integrated into LangChain and LlamaIndex. The hosted-API version adds OCR, layout detection, and chunking strategies that the open-source version exposes more bluntly.
What you give up: top-tier accuracy on dense tables and complex layouts. Unstructured’s own benchmark positions it as competitive with hosted services for standard documents and weaker on the difficult edge cases. For most teams this is the right starting point — the question is whether your documents are difficult enough to require an upgrade.
Marker — vision, open-source
Marker is the open-source vision-first parser that’s mostly closed the gap to the hosted services. It uses a layout model plus a small VLM to identify and parse complex regions, runs locally on a GPU, and produces clean markdown output.
The pitch is accuracy approaching commercial parsers without the per-page bill, which is genuinely useful for high-volume workloads where the per-page cost of hosted parsers (~$0.01-0.10 per page, depending on the vendor) adds up. The trade-off is the GPU infrastructure you need to run it at scale.
Mistral OCR — the commodity disrupter
Mistral OCR, launched in early 2025 and now in its third generation, is the price-disruptor play. At $2 per 1,000 pages — roughly an order of magnitude cheaper than the premium hosted parsers — and with reported 98.96% accuracy on scanned documents, it has reset baseline expectations for what commodity-tier OCR should cost. The most recent version also claims 88.9% accuracy on handwriting and 96.6% on tables, which is competitive with the premium hosted vendors on those specific document types.
The Mistral pitch — quoting from their own launch — is “SOTA document parsing at commodity pricing.” Whether that holds up against Reducto on the hardest tables remains contested, but on the bulk of enterprise document workloads, Mistral OCR is the new floor for what parsing should cost.
Anthropic and OpenAI native PDF support — useful, not a parser
Anthropic’s PDF support API and OpenAI’s equivalent let you pass a PDF directly to the model and get reasoning over its content. This is genuinely useful for one-off document analysis, but it’s not a production parsing solution — Anthropic’s API caps documents at 100 pages and 32MB, and you’re paying full input token costs per page. The right mental model is “PDF analysis at the model boundary,” not “PDF parser.” For the production indexing pipeline, you still need one of the dedicated parsers above.
What a benchmark on a fintech 10-K actually looks like
The cleanest public benchmark of this entire landscape is OmniDocBench, published at CVPR 2025 by OpenDataLab. It evaluates parsers across 1,651 PDF pages spanning 10 document types — academic papers, financial reports, newspapers, slides, scanned PDFs, handwritten notes — with detailed annotations and per-element scoring.
A summary of the headline OmniDocBench findings, useful for calibrating expectations:
- Pipeline tools (MinerU, Mathpix) outperform vision-LLM parsers (raw GPT-4o, Gemini direct PDF) on English text and formula extraction.
- VLMs outperform pipeline tools on slides and handwritten documents — different shape of failure modes.
- All models degrade noticeably on complex multi-column layouts and dense tables.
- No single parser wins every category. This is the part that matters most for procurement: you cannot pick a parser by reading one benchmark.
Reducto’s own published comparison on a curated fintech-document set claims a similar pattern with Reducto on top — predictably, since they made the benchmark — but the broader takeaway holds: the right parser depends on what kind of documents you actually have.
The honest truth about regulated content
For regulated financial documents — 10-Ks, prospectuses, clinical trial reports, regulatory filings — every team I’ve seen ship a production RAG system still has humans in the loop reviewing the parsed tables and the answers the system generates from them. The parsers are good enough to make the human review tractable; they are not good enough to remove the human entirely.
This isn’t a marketing claim from any vendor; it’s the operational reality. When the cost of a wrong answer is regulatory exposure or material misstatement, the workflow looks like:
- Parser extracts the document.
- RAG system answers a question, citing specific table cells.
- Human reviewer checks the cited cells against the original PDF.
- Discrepancies route to a remediation queue.
The parsers help by making step 3 fast — when the citation is precise and the underlying table extraction is mostly right, the human review takes seconds per question instead of minutes. But the human is not yet removable, no matter what any vendor’s marketing implies.
What to take away
PDF parsing is the part of the RAG stack that most teams underestimate, and the part where vendor selection genuinely matters more than algorithm selection. Three lines:
- Benchmark on your own documents, not on someone else’s benchmark. Pick three documents that represent the hard cases in your corpus — the densest table, the worst multi-column page, the lowest-quality scan — and run every vendor on those before signing a contract. The vendor leaderboards do not generalize to your mix.
- Mistral OCR has reset the price floor. $2 per 1,000 pages with respectable accuracy is the new bulk-workload baseline. Pay more only when you have a specific reason — dense tables, complex layouts, regulatory accuracy requirements.
- Humans stay in the loop for regulated content. Every honest vendor will tell you this off the record. Plan the review workflow into your RAG product from day one, not as a bolt-on later.
The PDF problem is not glamorous, and the post-2024 wave of RAG infrastructure largely tried to look past it. The teams shipping real production RAG didn’t. The blunt take: spend more engineering effort on your parser than on your reranker.
Further reading: LlamaParse docs, the Reducto comparison page, Unstructured’s benchmarks, the Marker GitHub, Mistral OCR launch, and the OmniDocBench paper.