What chunking strategies exist for RAG and how do you choose between them?
Chunking splits source documents into retrievable units before embedding. The right strategy depends on document structure, query style, and the model's context window. Fixed-size chunks are simple but break mid-sentence; semantic or structural chunking preserves coherence; hierarchical chunking enables parent-document retrieval for richer context.
How to think about it
Chunking is one of the highest-leverage decisions in a RAG pipeline. The retriever returns chunks, not full documents, so each chunk must be independently meaningful yet small enough that irrelevant text does not dilute the signal.
Common strategies
| Strategy | Description | Best for |
|---|---|---|
| Fixed-size (token) | Split every N tokens, overlap M | Fast baseline; homogeneous text |
| Sentence / paragraph | Split on natural boundaries | Prose documents, FAQs |
| Recursive character | LangChain default — tries paragraph, then sentence, then word | General-purpose |
| Semantic chunking | Split where cosine similarity between adjacent sentences drops | Long, heterogeneous docs |
| Structural (Markdown / HTML) | Split on headers, sections | Technical docs, wikis |
| Hierarchical (parent-child) | Index small child chunks, retrieve parent for context | Dense knowledge bases |
Choosing chunk size
- Too large: retrieval returns verbose, noisy chunks; irrelevant sentences hurt answer quality.
- Too small: the chunk lacks sufficient context for the model to synthesize a good answer.
- Rule of thumb: target 256–512 tokens with a 10–20 % overlap for general prose.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=400, # tokens (approximate via character proxy)
chunk_overlap=60,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document_text)
Hierarchical retrieval (parent-document)
Index small child chunks (128 tokens) for precise matching; when a child is retrieved, look up its parent chunk (512 tokens) and send the parent to the LLM. This keeps retrieval precise while giving the model enough context to reason.
# Pseudo-code
child_chunks = split(doc, size=128)
parent_chunks = split(doc, size=512)
# Store mapping: child_id -> parent_id
# At inference: retrieve child, return parent text