datarekha
NLP & LLMs Easy Asked at OpenAIAsked at GoogleAsked at MicrosoftAsked at Databricks

What is Retrieval-Augmented Generation (RAG) and why is it used?

The short answer

RAG couples a retrieval step — fetching relevant documents from an external store — with a generative model so the LLM can answer questions about knowledge it was never trained on. It solves the stale-knowledge and hallucination problems without retraining. The pattern is preferred when the knowledge base changes frequently or contains proprietary data.

How to think about it

RAG = Retrieval-Augmented Generation. Instead of baking knowledge into model weights, you retrieve it on demand and hand it to the model as context.

Why it matters

LLMs are frozen at training time. A GPT-4-class model trained in mid-2024 knows nothing about your internal documentation, last week’s earnings call, or a regulation published yesterday. Fine-tuning can patch some of this, but retraining is expensive and you still cannot update it in real time. RAG lets you swap the knowledge base without touching the model.

The pipeline

User QueryEmbed QueryVector SearchAugment PromptLLM GenerateAnswer
RAG pipeline: query → embed → retrieve → augment → generate
  1. Embed the user query into a dense vector.
  2. Retrieve the top-k most similar document chunks from a vector store.
  3. Augment the LLM prompt with those chunks as context.
  4. Generate a grounded answer.
from openai import OpenAI
import numpy as np

client = OpenAI()

def retrieve(query: str, index, top_k: int = 4) -> list[str]:
    q_vec = client.embeddings.create(
        model="text-embedding-3-small", input=query
    ).data[0].embedding
    scores = index @ np.array(q_vec)          # cosine via pre-normalised index
    top_idx = np.argsort(scores)[-top_k:][::-1]
    return [chunks[i] for i in top_idx]

def rag_answer(query: str, index) -> str:
    context = "\n\n".join(retrieve(query, index))
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer using only the provided context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
        ],
    )
    return response.choices[0].message.content

Keep practising

All NLP & LLMs questions

Explore further

Skip to content