Firecrawl — web data for LLMs & RAG
Turn any URL into clean, LLM-ready Markdown. Firecrawl handles JS rendering, boilerplate removal, pagination, and full-site crawls so your RAG pipeline and agents get signal — not noise.
What you'll learn
- Why raw HTML is a poor fit for LLMs — and what clean Markdown buys you
- The three Firecrawl operations: scrape, crawl, and structured extract
- How to wire Firecrawl into a RAG ingestion pipeline or an agent tool
Before you start
Most of what an LLM needs to know exists on the web. The problem is that
the web was designed for browsers, not language models. A typical news
article weighs in at 50 KB of HTML, but only 2 KB of that is the article
text. The rest is navigation, tracking pixels, <script> blocks, and
structured markup that means nothing to a model. Feed that raw HTML into
your context window and you waste tokens on noise, confuse the model with
markup syntax, and make retrieval less precise. Clean Markdown — just
prose, code blocks, and headings — is what you actually want.
There is a second, harder problem: modern web apps render in the browser. A simple HTTP GET to a React or Next.js page returns a near-empty HTML skeleton; the real content arrives only after JavaScript runs. A naive scraper captures the skeleton. You need a headless browser, but running one reliably at scale, across thousands of URLs, is operationally expensive.
Firecrawl is a hosted service (with a self-hostable open-source version) that takes a URL and gives you back clean Markdown or structured JSON. It handles JS rendering, pagination, boilerplate removal, and — if you need it — following links across an entire site.
Why clean Markdown beats raw HTML
| What lands in your context | Token cost | LLM signal |
|---|---|---|
| Raw HTML of a docs page | ~8 000 | Low — tags, attributes, scripts dominate |
| Markdown of the same page | ~1 500 | High — prose and code only |
Smaller, higher-signal context means cheaper inference, more focused retrieval, and fewer hallucinations caused by confusing markup. It also means you can fit more chunks into a single embedding pass.
The three core operations
Firecrawl exposes three conceptual operations. The Python SDK wraps them as methods on a client object.
Scrape — one page
Scrape fetches a single URL, renders JavaScript if needed, strips navigation and boilerplate, and returns the page content as Markdown (and optionally the raw HTML or page metadata). This is the unit operation; the others build on it.
import os
from firecrawl import FirecrawlApp # pip install firecrawl-py
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
result = app.scrape_url(
"https://docs.example.com/getting-started",
formats=["markdown"], # or ["markdown", "html", "links"]
# see current docs for the full options object
)
markdown_text = result.markdown
print(markdown_text[:500])
result.markdown is a plain string — ready to chunk, embed, and index.
Crawl — an entire site
Crawl starts from a seed URL, follows internal links up to a configurable depth and page limit, and scrapes each page. The result is a list of Markdown documents, one per page. This is how you ingest a documentation site or product blog into a vector store in one call.
crawl_result = app.crawl_url(
"https://docs.example.com",
limit=50, # max pages to visit
max_depth=3, # link-following depth from the seed
# see current docs for allow_patterns, exclude_patterns, etc.
)
# crawl_result.data is a list of page objects, each with .markdown
for page in crawl_result.data:
print(page.url, len(page.markdown), "chars")
Extract — structured data from a page
Extract goes beyond Markdown: you give Firecrawl a Pydantic schema or a natural-language prompt describing the fields you want, and it returns structured JSON. This is useful when you need to pull product prices, publication dates, or author names from a set of pages — rather than ingesting the full prose.
from pydantic import BaseModel
class ArticleInfo(BaseModel):
title: str
author: str | None
published_date: str | None
summary: str
# The extract API shape varies — check current Firecrawl docs for the
# exact method name and options (the SDK is actively developed).
extract_result = app.extract(
["https://blog.example.com/post-1", "https://blog.example.com/post-2"],
schema=ArticleInfo,
)
for item in extract_result.data:
print(item.title, item.published_date)
Where Firecrawl fits in your system
The diagram below shows the two main integration points: RAG ingestion (offline, batch) and agent tool use (online, per-request).
Firecrawl sits between the raw web and your LLM system, handling rendering and cleanup.
RAG ingestion pattern
The most common use case is a one-time or scheduled batch job: crawl your docs site, chunk the Markdown, embed it, and push chunks into a vector store.
import os
from firecrawl import FirecrawlApp
from openai import OpenAI # swap for any embeddings provider
crawl_app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
oai = OpenAI()
pages = crawl_app.crawl_url(
"https://docs.myproduct.com",
limit=100,
max_depth=2,
).data
documents = []
for page in pages:
# crude chunking — use a proper splitter (e.g. RecursiveCharacterTextSplitter)
# in production; this illustrates the shape
chunks = [page.markdown[i:i+1500] for i in range(0, len(page.markdown), 1500)]
for chunk in chunks:
embedding = oai.embeddings.create(
model="text-embedding-3-small",
input=chunk,
).data[0].embedding
documents.append({"text": chunk, "url": page.url, "embedding": embedding})
# → upsert `documents` into your vector store (Pinecone, pgvector, Qdrant, etc.)
Agent browsing tool pattern
For agents that need to read arbitrary URLs at query time — a research
assistant, a competitive-intelligence agent, a support bot that needs to
look up a live changelog — wrap scrape_url as a tool the model can
call.
def browse_url(url: str) -> str:
"""Fetch a web page and return its content as clean Markdown.
Use this to read any public URL when you need current information."""
result = crawl_app.scrape_url(url, formats=["markdown"])
# Truncate to avoid flooding the context; tune to your model's window
return (result.markdown or "")[:8000]
# Register `browse_url` as a tool in your agent framework of choice.
# The model calls it with a URL; it gets back prose it can reason over.
Scrape vs. crawl vs. extract — a quick map
- Scrape: one URL, one Markdown document. Use for targeted lookups — an agent reading a specific page, or ingesting a single article.
- Crawl: one seed URL, many pages. Use for bulk ingestion of a docs site, blog, or wiki into a vector store.
- Extract: one or more URLs, structured JSON fields. Use when you need specific data points rather than full prose — prices, dates, author names, specs.
All three share the same transport layer: Firecrawl’s managed infrastructure renders the page headlessly, follows redirects, handles auth challenges where configured, and strips boilerplate before your code sees a single byte.
Quick check
Practice this in an interview
All questionsRAG augments an LLM by retrieving relevant documents from an external knowledge store at query time and feeding them into the prompt as grounding context. A basic pipeline chunks and embeds documents into a vector store, retrieves the top-k most similar chunks for a query, and the LLM generates an answer conditioned on them, reducing hallucination and keeping knowledge current.
In LlamaIndex a Node is a chunk of a source document with metadata and relationships, indexed for retrieval; a query engine wraps an index to take a natural-language query, retrieve relevant nodes, and synthesize an answer. RAG-as-a-tool wraps a query engine in a QueryEngineTool so an agent can call it like any other tool, deciding when to retrieve from that knowledge source as part of its reasoning loop.
RAG couples a retrieval step — fetching relevant documents from an external store — with a generative model so the LLM can answer questions about knowledge it was never trained on. It solves the stale-knowledge and hallucination problems without retraining. The pattern is preferred when the knowledge base changes frequently or contains proprietary data.
RAG injects external knowledge at inference time and is best when information changes often, must be cited, or is too large to bake into weights. Fine-tuning changes model behavior, style, or format and is best for teaching new skills or domain tone; the two are complementary and often combined.