datarekha

LlamaParse — document parsing

Naive PDF text extraction destroys tables, columns, and layout — and poisons your RAG. LlamaParse uses vision models to turn messy documents into clean, LLM-ready markdown.

6 min read Beginner Agentic AI Lesson 23 of 42

What you'll learn

  • Why naive PDF extraction wrecks tables, columns, and reading order
  • How VLM-based parsing produces clean, structured markdown
  • Why parsing quality is the silent ceiling on RAG quality

Before you start

Here’s a RAG failure mode nobody warns you about: your retrieval is tuned, your prompts are great, and the answers are still garbage — because the documents were mangled the moment you loaded them. Real-world PDFs are full of tables, multiple columns, headers, and scanned pages, and naive text extraction turns all of that into scrambled soup. LlamaParse exists to fix the very first step.

Why naive extraction fails

A basic PDF text extractor reads the raw text stream, which has no idea about visual structure. The result:

  • Tables collapse — rows and columns flatten into a meaningless run of numbers with no association between a label and its value.
  • Columns interleave — a two-column page gets read straight across, splicing unrelated sentences together.
  • Reading order breaks — headers, footnotes, and captions land wherever they happen to sit in the byte stream.

Garbage in, garbage chunks, garbage retrieval.

Naive extractionLlamaParse (VLM)Plan Price RefundEnterprise Starter 99499 30 days norefund monthly…which price goes withwhich plan? lost.| Plan | Price | Refund ||------------|-------|--------|| Enterprise | $499 | 30 days|| Starter | $99 | no |structure preserved —label↔value intact.
Same table: naive extraction scrambles label↔value pairs; VLM parsing preserves them as clean markdown.

How LlamaParse is different

LlamaParse treats each page as an image and uses a vision-language model to read it the way a person would — seeing the table grid, the column boundaries, the heading hierarchy — and emits clean markdown (with real markdown tables). That markdown is what you then split into Nodes and index. Because the structure survives, a chunk like the Enterprise row keeps its price and its refund window together, so retrieval can actually answer “what’s the Enterprise refund window?”

from llama_cloud_services import LlamaParse
from llama_index.core import VectorStoreIndex

# Parse a gnarly PDF into clean markdown documents
docs = LlamaParse(result_type="markdown").load_data("pricing.pdf")

# Then the usual LlamaIndex pipeline — but now on clean, structured text
index = VectorStoreIndex.from_documents(docs)
answer = index.as_query_engine().query("What's the Enterprise refund window?")
print(answer)   # correct, because the table survived parsing

Quick check

Quick check

0/3
Q1Why does naive PDF text extraction hurt RAG quality?
Q2How does LlamaParse produce clean output?
Q3If your RAG answers are poor and your documents are table-heavy PDFs, what should you fix first?

Next

That completes the LlamaIndex cluster. Next, a different production framework: the OpenAI Agents SDK and its handoffs + guardrails model.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
In LlamaIndex, what are nodes and query engines, and how is RAG exposed as a tool to an agent?

In LlamaIndex a Node is a chunk of a source document with metadata and relationships, indexed for retrieval; a query engine wraps an index to take a natural-language query, retrieve relevant nodes, and synthesize an answer. RAG-as-a-tool wraps a query engine in a QueryEngineTool so an agent can call it like any other tool, deciding when to retrieve from that knowledge source as part of its reasoning loop.

What is Retrieval-Augmented Generation (RAG) and how does a basic RAG pipeline work?

RAG augments an LLM by retrieving relevant documents from an external knowledge store at query time and feeding them into the prompt as grounding context. A basic pipeline chunks and embeds documents into a vector store, retrieves the top-k most similar chunks for a query, and the LLM generates an answer conditioned on them, reducing hallucination and keeping knowledge current.

What is Retrieval-Augmented Generation (RAG) and why is it used?

RAG couples a retrieval step — fetching relevant documents from an external store — with a generative model so the LLM can answer questions about knowledge it was never trained on. It solves the stale-knowledge and hallucination problems without retraining. The pattern is preferred when the knowledge base changes frequently or contains proprietary data.

How do you evaluate the quality of an LLM or RAG system?

Evaluation splits into retrieval quality (did we fetch the right chunks?) and generation quality (did the model use them correctly?). Key metrics are context precision/recall for retrieval and faithfulness plus answer relevance for generation. Frameworks like RAGAS automate LLM-as-judge scoring; human annotation anchors the ground truth.

Related lessons

Explore further

Skip to content