What is Retrieval-Augmented Generation (RAG) and how does a basic RAG pipeline work?

RAG augments an LLM by retrieving relevant documents from an external knowledge store at query time and feeding them into the prompt as grounding context. A basic pipeline chunks and embeds documents into a vector store, retrieves the top-k most similar chunks for a query, and the LLM generates an answer conditioned on them, reducing hallucination and keeping knowledge current.

In LlamaIndex, what are nodes and query engines, and how is RAG exposed as a tool to an agent?

In LlamaIndex a Node is a chunk of a source document with metadata and relationships, indexed for retrieval; a query engine wraps an index to take a natural-language query, retrieve relevant nodes, and synthesize an answer. RAG-as-a-tool wraps a query engine in a QueryEngineTool so an agent can call it like any other tool, deciding when to retrieve from that knowledge source as part of its reasoning loop.

Compare RAG and fine-tuning. When would you use each?

RAG injects external knowledge at inference time and is best when information changes often, must be cited, or is too large to bake into weights. Fine-tuning changes model behavior, style, or format and is best for teaching new skills or domain tone; the two are complementary and often combined.

What is Retrieval-Augmented Generation (RAG) and why is it used?

RAG couples a retrieval step — fetching relevant documents from an external store — with a generative model so the LLM can answer questions about knowledge it was never trained on. It solves the stale-knowledge and hallucination problems without retraining. The pattern is preferred when the knowledge base changes frequently or contains proprietary data.

Firecrawl — web data for LLMs & RAG — Agentic AI

Most of what an LLM needs to know exists on the web. The problem is that the web was designed for browsers, not language models. A typical news article weighs in at 50 KB of HTML, but only 2 KB of that is the article text. The rest is navigation, tracking pixels, <script> blocks, and structured markup that means nothing to a model. Feed that raw HTML into your context window and you waste tokens on noise, confuse the model with markup syntax, and make retrieval less precise. Clean Markdown — just prose, code blocks, and headings — is what you actually want.

There is a second, harder problem: modern web apps render in the browser. A simple HTTP GET to a React or Next.js page returns a near-empty HTML skeleton; the real content arrives only after JavaScript runs. A naive scraper captures the skeleton. You need a headless browser, but running one reliably at scale, across thousands of URLs, is operationally expensive.

Firecrawl is a hosted service (with a self-hostable open-source version) that takes a URL and gives you back clean Markdown or structured JSON. It handles JS rendering, pagination, boilerplate removal, and — if you need it — following links across an entire site.

Why clean Markdown beats raw HTML

What lands in your context	Token cost	LLM signal
Raw HTML of a docs page	~8 000	Low — tags, attributes, scripts dominate
Markdown of the same page	~1 500	High — prose and code only

Smaller, higher-signal context means cheaper inference, more focused retrieval, and fewer hallucinations caused by confusing markup. It also means you can fit more chunks into a single embedding pass.

The three core operations

Firecrawl exposes three conceptual operations. The Python SDK wraps them as methods on a client object. (The answer to the prompt above: a plain GET captured the empty pre-JavaScript skeleton — scrape renders the page in a headless browser first, so you get the real text.)

Scrape — one page

Scrape fetches a single URL, renders JavaScript if needed, strips navigation and boilerplate, and returns the page content as Markdown (and optionally the raw HTML or page metadata). This is the unit operation; the others build on it.

import os
from firecrawl import FirecrawlApp  # pip install firecrawl-py

app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

result = app.scrape_url(
    "https://docs.example.com/getting-started",
    formats=["markdown"],          # or ["markdown", "html", "links"]
    # see current docs for the full options object
)

markdown_text = result.markdown
print(markdown_text[:500])

result.markdown is a plain string — ready to chunk, embed, and index.

Crawl — an entire site

Crawl starts from a seed URL, follows internal links up to a configurable depth and page limit, and scrapes each page. The result is a list of Markdown documents, one per page. This is how you ingest a documentation site or product blog into a vector store in one call.

crawl_result = app.crawl_url(
    "https://docs.example.com",
    limit=50,          # max pages to visit
    max_depth=3,       # link-following depth from the seed
    # see current docs for allow_patterns, exclude_patterns, etc.
)

# crawl_result.data is a list of page objects, each with .markdown
for page in crawl_result.data:
    print(page.url, len(page.markdown), "chars")

Extract — structured data from a page

Extract goes beyond Markdown: you give Firecrawl a Pydantic schema or a natural-language prompt describing the fields you want, and it returns structured JSON. This is useful when you need to pull product prices, publication dates, or author names from a set of pages — rather than ingesting the full prose.

from pydantic import BaseModel

class ArticleInfo(BaseModel):
    title: str
    author: str | None
    published_date: str | None
    summary: str

# The extract API shape varies — check current Firecrawl docs for the
# exact method name and options (the SDK is actively developed).
extract_result = app.extract(
    ["https://blog.example.com/post-1", "https://blog.example.com/post-2"],
    schema=ArticleInfo,
)

for item in extract_result.data:
    print(item.title, item.published_date)

Where Firecrawl fits in your system

The diagram below shows the two main integration points: RAG ingestion (offline, batch) and agent tool use (online, per-request).

Firecrawl sits between the raw web and your LLM system, handling rendering and cleanup.

RAG ingestion pattern

The most common use case is a one-time or scheduled batch job: crawl your docs site, chunk the Markdown, embed it, and push chunks into a vector store.

import os
from firecrawl import FirecrawlApp
from openai import OpenAI  # swap for any embeddings provider

crawl_app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
oai = OpenAI()

pages = crawl_app.crawl_url(
    "https://docs.myproduct.com",
    limit=100,
    max_depth=2,
).data

documents = []
for page in pages:
    # crude chunking — use a proper splitter (e.g. RecursiveCharacterTextSplitter)
    # in production; this illustrates the shape
    chunks = [page.markdown[i:i+1500] for i in range(0, len(page.markdown), 1500)]
    for chunk in chunks:
        embedding = oai.embeddings.create(
            model="text-embedding-3-small",
            input=chunk,
        ).data[0].embedding
        documents.append({"text": chunk, "url": page.url, "embedding": embedding})

# → upsert `documents` into your vector store (Pinecone, pgvector, Qdrant, etc.)

Agent browsing tool pattern

For agents that need to read arbitrary URLs at query time — a research assistant, a competitive-intelligence agent, a support bot that needs to look up a live changelog — wrap scrape_url as a tool the model can call.

def browse_url(url: str) -> str:
    """Fetch a web page and return its content as clean Markdown.
    Use this to read any public URL when you need current information."""
    result = crawl_app.scrape_url(url, formats=["markdown"])
    # Truncate to avoid flooding the context; tune to your model's window
    return (result.markdown or "")[:8000]

# Register `browse_url` as a tool in your agent framework of choice.
# The model calls it with a URL; it gets back prose it can reason over.

Scrape vs. crawl vs. extract — a quick map

Scrape: one URL, one Markdown document. Use for targeted lookups — an agent reading a specific page, or ingesting a single article.
Crawl: one seed URL, many pages. Use for bulk ingestion of a docs site, blog, or wiki into a vector store.
Extract: one or more URLs, structured JSON fields. Use when you need specific data points rather than full prose — prices, dates, author names, specs.

All three share the same transport layer: Firecrawl’s managed infrastructure renders the page headlessly, follows redirects, handles auth challenges where configured, and strips boilerplate before your code sees a single byte.

In one breath

The web is built for browsers, not LLMs — raw HTML is mostly nav, ads, and scripts; feeding it wastes tokens, confuses the model, and hurts retrieval. You want clean Markdown.
Firecrawl solves two hard problems: boilerplate removal (≈8k HTML tokens → ≈1.5k Markdown) and JS rendering (a plain GET on a React/Next page returns an empty skeleton; a headless browser gets the real content).
Three operations: scrape (one URL → one Markdown doc), crawl (seed → many pages, capped by limit/max_depth), extract (URLs + schema → structured JSON).
Two integration points: RAG ingestion (offline batch — crawl → chunk → embed → store) and an agent browsing tool (online — wrap scrape_url, trim to ~4–8k chars).
Crawls are metered and legally bounded — cap limit/max_depth, respect robots.txt and ToS, and verify retrieved facts before acting (web content is untrusted and can be stale).

Quick check

0/3

Q1Why does Firecrawl return Markdown instead of raw HTML?

Q2Your agent needs to read a competitor's pricing page that is rendered by React. Which Firecrawl operation should you use, and why?

Q3You are building a nightly job that refreshes a vector store from your company's 200-page documentation site. Which Firecrawl operation fits, and what guard should you set?

A subtle point hides in that warning: scraped web content is untrusted input. The moment your agent reads a fetched page, you are back in agent security territory — a page can carry a prompt injection, so treat scraped Markdown as data, never as instructions.

Firecrawl — web data for LLMs & RAG

What you'll learn

Before you start

Why clean Markdown beats raw HTML

The three core operations

Scrape — one page

Crawl — an entire site

Extract — structured data from a page

Where Firecrawl fits in your system

RAG ingestion pattern

Agent browsing tool pattern

Scrape vs. crawl vs. extract — a quick map

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further