What chunking strategies exist for RAG and how do you choose between them?

For AI / LLM Engineer ML Engineer Data Scientist

The short answer

Chunking splits source documents into retrievable units before embedding. The right strategy depends on document structure, query style, and the model's context window. Fixed-size chunks are simple but break mid-sentence; semantic or structural chunking preserves coherence; hierarchical chunking enables parent-document retrieval for richer context.

How to think about it

Chunking is one of the highest-leverage decisions in a RAG pipeline. The retriever returns chunks, not full documents, so each chunk must be independently meaningful yet small enough that irrelevant text does not dilute the signal.

Common strategies

Strategy	Description	Best for
Fixed-size (token)	Split every N tokens, overlap M	Fast baseline; homogeneous text
Sentence / paragraph	Split on natural boundaries	Prose documents, FAQs
Recursive character	LangChain default — tries paragraph, then sentence, then word	General-purpose
Semantic chunking	Split where cosine similarity between adjacent sentences drops	Long, heterogeneous docs
Structural (Markdown / HTML)	Split on headers, sections	Technical docs, wikis
Hierarchical (parent-child)	Index small child chunks, retrieve parent for context	Dense knowledge bases

Choosing chunk size

Too large: retrieval returns verbose, noisy chunks; irrelevant sentences hurt answer quality.
Too small: the chunk lacks sufficient context for the model to synthesize a good answer.
Rule of thumb: target 256–512 tokens with a 10–20 % overlap for general prose.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,        # tokens (approximate via character proxy)
    chunk_overlap=60,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document_text)

Hierarchical retrieval (parent-document)

Index small child chunks (128 tokens) for precise matching; when a child is retrieved, look up its parent chunk (512 tokens) and send the parent to the LLM. This keeps retrieval precise while giving the model enough context to reason.

# Pseudo-code
child_chunks = split(doc, size=128)
parent_chunks = split(doc, size=512)
# Store mapping: child_id -> parent_id
# At inference: retrieve child, return parent text

What chunking strategies exist for RAG and how do you choose between them?

Keep practising

Explore further