datarekha

Firecrawl — web data for LLMs & RAG

Turn any URL into clean, LLM-ready Markdown. Firecrawl handles JS rendering, boilerplate removal, pagination, and full-site crawls so your RAG pipeline and agents get signal — not noise.

7 min read Intermediate Agentic AI Lesson 42 of 42

What you'll learn

  • Why raw HTML is a poor fit for LLMs — and what clean Markdown buys you
  • The three Firecrawl operations: scrape, crawl, and structured extract
  • How to wire Firecrawl into a RAG ingestion pipeline or an agent tool

Before you start

Most of what an LLM needs to know exists on the web. The problem is that the web was designed for browsers, not language models. A typical news article weighs in at 50 KB of HTML, but only 2 KB of that is the article text. The rest is navigation, tracking pixels, <script> blocks, and structured markup that means nothing to a model. Feed that raw HTML into your context window and you waste tokens on noise, confuse the model with markup syntax, and make retrieval less precise. Clean Markdown — just prose, code blocks, and headings — is what you actually want.

There is a second, harder problem: modern web apps render in the browser. A simple HTTP GET to a React or Next.js page returns a near-empty HTML skeleton; the real content arrives only after JavaScript runs. A naive scraper captures the skeleton. You need a headless browser, but running one reliably at scale, across thousands of URLs, is operationally expensive.

Firecrawl is a hosted service (with a self-hostable open-source version) that takes a URL and gives you back clean Markdown or structured JSON. It handles JS rendering, pagination, boilerplate removal, and — if you need it — following links across an entire site.

Why clean Markdown beats raw HTML

What lands in your contextToken costLLM signal
Raw HTML of a docs page~8 000Low — tags, attributes, scripts dominate
Markdown of the same page~1 500High — prose and code only

Smaller, higher-signal context means cheaper inference, more focused retrieval, and fewer hallucinations caused by confusing markup. It also means you can fit more chunks into a single embedding pass.

The three core operations

Firecrawl exposes three conceptual operations. The Python SDK wraps them as methods on a client object.

Scrape — one page

Scrape fetches a single URL, renders JavaScript if needed, strips navigation and boilerplate, and returns the page content as Markdown (and optionally the raw HTML or page metadata). This is the unit operation; the others build on it.

import os
from firecrawl import FirecrawlApp  # pip install firecrawl-py

app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

result = app.scrape_url(
    "https://docs.example.com/getting-started",
    formats=["markdown"],          # or ["markdown", "html", "links"]
    # see current docs for the full options object
)

markdown_text = result.markdown
print(markdown_text[:500])

result.markdown is a plain string — ready to chunk, embed, and index.

Crawl — an entire site

Crawl starts from a seed URL, follows internal links up to a configurable depth and page limit, and scrapes each page. The result is a list of Markdown documents, one per page. This is how you ingest a documentation site or product blog into a vector store in one call.

crawl_result = app.crawl_url(
    "https://docs.example.com",
    limit=50,          # max pages to visit
    max_depth=3,       # link-following depth from the seed
    # see current docs for allow_patterns, exclude_patterns, etc.
)

# crawl_result.data is a list of page objects, each with .markdown
for page in crawl_result.data:
    print(page.url, len(page.markdown), "chars")

Extract — structured data from a page

Extract goes beyond Markdown: you give Firecrawl a Pydantic schema or a natural-language prompt describing the fields you want, and it returns structured JSON. This is useful when you need to pull product prices, publication dates, or author names from a set of pages — rather than ingesting the full prose.

from pydantic import BaseModel

class ArticleInfo(BaseModel):
    title: str
    author: str | None
    published_date: str | None
    summary: str

# The extract API shape varies — check current Firecrawl docs for the
# exact method name and options (the SDK is actively developed).
extract_result = app.extract(
    ["https://blog.example.com/post-1", "https://blog.example.com/post-2"],
    schema=ArticleInfo,
)

for item in extract_result.data:
    print(item.title, item.published_date)

Where Firecrawl fits in your system

The diagram below shows the two main integration points: RAG ingestion (offline, batch) and agent tool use (online, per-request).

URL(s)raw webFirecrawl1. JS render (headless browser)2. Boilerplate removal3. Markdown / structured extractClean outputMarkdown stringor structured JSONRAG pipelinechunk → embed → storeAgent toollive browsing

Firecrawl sits between the raw web and your LLM system, handling rendering and cleanup.

RAG ingestion pattern

The most common use case is a one-time or scheduled batch job: crawl your docs site, chunk the Markdown, embed it, and push chunks into a vector store.

import os
from firecrawl import FirecrawlApp
from openai import OpenAI  # swap for any embeddings provider

crawl_app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
oai = OpenAI()

pages = crawl_app.crawl_url(
    "https://docs.myproduct.com",
    limit=100,
    max_depth=2,
).data

documents = []
for page in pages:
    # crude chunking — use a proper splitter (e.g. RecursiveCharacterTextSplitter)
    # in production; this illustrates the shape
    chunks = [page.markdown[i:i+1500] for i in range(0, len(page.markdown), 1500)]
    for chunk in chunks:
        embedding = oai.embeddings.create(
            model="text-embedding-3-small",
            input=chunk,
        ).data[0].embedding
        documents.append({"text": chunk, "url": page.url, "embedding": embedding})

# → upsert `documents` into your vector store (Pinecone, pgvector, Qdrant, etc.)

Agent browsing tool pattern

For agents that need to read arbitrary URLs at query time — a research assistant, a competitive-intelligence agent, a support bot that needs to look up a live changelog — wrap scrape_url as a tool the model can call.

def browse_url(url: str) -> str:
    """Fetch a web page and return its content as clean Markdown.
    Use this to read any public URL when you need current information."""
    result = crawl_app.scrape_url(url, formats=["markdown"])
    # Truncate to avoid flooding the context; tune to your model's window
    return (result.markdown or "")[:8000]

# Register `browse_url` as a tool in your agent framework of choice.
# The model calls it with a URL; it gets back prose it can reason over.

Scrape vs. crawl vs. extract — a quick map

  • Scrape: one URL, one Markdown document. Use for targeted lookups — an agent reading a specific page, or ingesting a single article.
  • Crawl: one seed URL, many pages. Use for bulk ingestion of a docs site, blog, or wiki into a vector store.
  • Extract: one or more URLs, structured JSON fields. Use when you need specific data points rather than full prose — prices, dates, author names, specs.

All three share the same transport layer: Firecrawl’s managed infrastructure renders the page headlessly, follows redirects, handles auth challenges where configured, and strips boilerplate before your code sees a single byte.


Quick check

0/3
Q1Why does Firecrawl return Markdown instead of raw HTML?
Q2Your agent needs to read a competitor's pricing page that is rendered by React. Which Firecrawl operation should you use, and why?
Q3You are building a nightly job that refreshes a vector store from your company's 200-page documentation site. Which Firecrawl operation fits, and what guard should you set?

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What is Retrieval-Augmented Generation (RAG) and how does a basic RAG pipeline work?

RAG augments an LLM by retrieving relevant documents from an external knowledge store at query time and feeding them into the prompt as grounding context. A basic pipeline chunks and embeds documents into a vector store, retrieves the top-k most similar chunks for a query, and the LLM generates an answer conditioned on them, reducing hallucination and keeping knowledge current.

In LlamaIndex, what are nodes and query engines, and how is RAG exposed as a tool to an agent?

In LlamaIndex a Node is a chunk of a source document with metadata and relationships, indexed for retrieval; a query engine wraps an index to take a natural-language query, retrieve relevant nodes, and synthesize an answer. RAG-as-a-tool wraps a query engine in a QueryEngineTool so an agent can call it like any other tool, deciding when to retrieve from that knowledge source as part of its reasoning loop.

What is Retrieval-Augmented Generation (RAG) and why is it used?

RAG couples a retrieval step — fetching relevant documents from an external store — with a generative model so the LLM can answer questions about knowledge it was never trained on. It solves the stale-knowledge and hallucination problems without retraining. The pattern is preferred when the knowledge base changes frequently or contains proprietary data.

Compare RAG and fine-tuning. When would you use each?

RAG injects external knowledge at inference time and is best when information changes often, must be cited, or is too large to bake into weights. Fine-tuning changes model behavior, style, or format and is best for teaching new skills or domain tone; the two are complementary and often combined.

Related lessons

Explore further

Skip to content