What is Retrieval-Augmented Generation (RAG) and why is it used?
RAG couples a retrieval step — fetching relevant documents from an external store — with a generative model so the LLM can answer questions about knowledge it was never trained on. It solves the stale-knowledge and hallucination problems without retraining. The pattern is preferred when the knowledge base changes frequently or contains proprietary data.
How to think about it
RAG = Retrieval-Augmented Generation. Instead of baking knowledge into model weights, you retrieve it on demand and hand it to the model as context.
Why it matters
LLMs are frozen at training time. A GPT-4-class model trained in mid-2024 knows nothing about your internal documentation, last week’s earnings call, or a regulation published yesterday. Fine-tuning can patch some of this, but retraining is expensive and you still cannot update it in real time. RAG lets you swap the knowledge base without touching the model.
The pipeline
- Embed the user query into a dense vector.
- Retrieve the top-k most similar document chunks from a vector store.
- Augment the LLM prompt with those chunks as context.
- Generate a grounded answer.
from openai import OpenAI
import numpy as np
client = OpenAI()
def retrieve(query: str, index, top_k: int = 4) -> list[str]:
q_vec = client.embeddings.create(
model="text-embedding-3-small", input=query
).data[0].embedding
scores = index @ np.array(q_vec) # cosine via pre-normalised index
top_idx = np.argsort(scores)[-top_k:][::-1]
return [chunks[i] for i in top_idx]
def rag_answer(query: str, index) -> str:
context = "\n\n".join(retrieve(query, index))
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer using only the provided context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
],
)
return response.choices[0].message.content