Question 1

What is an AI agent, and how does it differ from a single LLM call?

Accepted Answer

An agent is an LLM placed in a loop where it reasons, chooses and calls tools or actions, observes the results, and repeats until a goal is met, rather than producing one response and stopping. The key differences are autonomy, tool use, memory and state, and multi-step control flow driven by the model's own decisions.

Question 2

What is chain-of-thought prompting and when does it help?

Accepted Answer

Chain-of-thought (CoT) prompting instructs the model to write out intermediate reasoning steps before producing a final answer, which improves accuracy on multi-step arithmetic, logic puzzles, and compositional questions. It is most impactful on models with at least ~10B parameters and on tasks where the answer space is large enough that guessing is hard.

Question 3

What is Chain-of-Thought prompting and how does it aid reasoning?

Accepted Answer

Chain-of-Thought prompting asks the model to generate intermediate reasoning steps before its final answer, either via examples or instructions like think step by step. Producing intermediate steps lets the model decompose multi-step problems and conditions the final answer on its own reasoning, improving accuracy on arithmetic, logic, and multi-hop tasks.

Question 4

What is a context window in an LLM and why does its size matter?

Accepted Answer

The context window is the maximum number of tokens an LLM can attend to in a single forward pass — both the input prompt and the model's own generated output count toward this limit. Its size determines how much prior text influences each prediction, sets a hard ceiling on document length and conversation history, and drives memory and compute costs that scale quadratically with sequence length under standard attention.

Question 5

Why is cosine similarity preferred over Euclidean distance for comparing text vectors?

Accepted Answer

Cosine similarity measures the angle between two vectors, making it invariant to vector magnitude — so a short document and a long document on the same topic score high regardless of length differences. Euclidean distance conflates directional difference with scale difference, which is misleading for sparse or length-varying text.

Question 6

What are embeddings, and how do you measure similarity between them for vector search?

Accepted Answer

Embeddings are dense vectors that map text or other data into a geometric space where semantically similar items are close together. Vector search ranks candidates by similarity, most commonly cosine similarity or dot product and sometimes Euclidean distance, retrieving the nearest vectors to a query embedding.

Question 7

How does an LLM generate text — what is next-token prediction and autoregression?

Accepted Answer

An LLM generates text one token at a time by computing a probability distribution over its entire vocabulary for the next token, sampling from that distribution, appending the result, and repeating — a process called autoregression. Each new token is conditioned on all previously generated tokens, so the output at step N is only as good as the choices made at steps 1 through N-1.

Question 8

What are n-grams and when should you use them in NLP?

Accepted Answer

An n-gram is a contiguous sequence of n tokens from text — bigrams capture two-word phrases, trigrams capture three. They add local word-order context to bag-of-words models, improving tasks like language modelling, spell-checking, and text classification where short phrases are discriminative.

Question 9

What prompt engineering techniques should every LLM practitioner know?

Accepted Answer

The core toolkit is: system prompts (role and constraints), few-shot examples (format and tone anchoring), chain-of-thought (step-by-step reasoning), and output constraints (JSON schema, stop sequences). Combining these predictably closes the gap between a capable base model and a production-ready feature.

Question 10

What is Retrieval-Augmented Generation (RAG) and how does a basic RAG pipeline work?

Accepted Answer

RAG augments an LLM by retrieving relevant documents from an external knowledge store at query time and feeding them into the prompt as grounding context. A basic pipeline chunks and embeds documents into a vector store, retrieves the top-k most similar chunks for a query, and the LLM generates an answer conditioned on them, reducing hallucination and keeping knowledge current.

Question 11

What is sequence padding and why is it necessary for batch training?

Accepted Answer

Padding adds dummy tokens to shorter sequences so all examples in a batch share the same length, which is required for tensor operations. Attention masks tell the model to ignore padded positions, preventing them from contributing to loss or attention scores.

Question 12

What is the difference between stemming and lemmatization?

Accepted Answer

Stemming strips suffixes using heuristic rules and may produce non-words, while lemmatization uses a vocabulary and morphological analysis to return the canonical dictionary form. Lemmatization is slower but always produces valid words, making it preferable when interpretability matters.

Question 13

What are stop words and when should you remove them?

Accepted Answer

Stop words are high-frequency function words — 'the', 'is', 'at', 'which' — that typically carry little discriminative content for tasks like classification or retrieval. Removing them reduces vocabulary size and noise, but for tasks like sentiment analysis or question answering, some function words can be semantically important and should be kept.

Question 14

How do you reliably get structured outputs (JSON, typed objects) from an LLM?

Accepted Answer

Modern APIs offer constrained decoding — the model's token sampling is restricted to only produce tokens that are valid continuations of a JSON schema. Combined with Pydantic validation in application code, this eliminates the JSON-parsing errors that plagued earlier prompt-only approaches. When constrained decoding is unavailable, few-shot examples plus output parsing with retry is the fallback.

Question 15

How do temperature, top-k, and top-p sampling control LLM generation?

Accepted Answer

Temperature rescales the logits before softmax: low values sharpen the distribution toward greedy deterministic output and high values flatten it for more randomness. Top-k restricts sampling to the k most likely tokens, and top-p or nucleus sampling restricts it to the smallest set of tokens whose cumulative probability exceeds p, both trimming the unlikely tail.

Question 16

What is TF-IDF and how does it improve on raw bag-of-words counts?

Accepted Answer

TF-IDF weights each term by how often it appears in a document (TF) scaled down by how common it is across the whole corpus (IDF), so words that are frequent everywhere — like 'the' — get low scores while distinctive terms get high scores. This makes document vectors more informative than raw counts for retrieval and classification.

Question 17

What are tokens in an LLM and why is API pricing per token rather than per word or character?

Accepted Answer

A token is the smallest unit a language model processes — typically a word, sub-word fragment, or punctuation mark produced by a byte-pair encoding (BPE) or similar algorithm. Pricing is per token because each token requires one forward-pass position in the attention matrix, directly driving compute and memory cost regardless of whether it maps to a full word or a single letter.

Question 18

What do 'parameters' mean in a language model and what do they actually store?

Accepted Answer

Parameters are the learnable floating-point numbers — weights and biases — that define a neural network's behaviour. In a transformer LLM, they are distributed across token embedding matrices, multi-head attention projection matrices (Q, K, V, O), and feed-forward network layers. They encode compressed statistical associations between tokens learned during training, not explicit facts or rules.

Question 19

What is a vector database and how does it enable semantic retrieval?

Accepted Answer

A vector database stores dense numerical embeddings alongside their source documents and uses approximate nearest-neighbor (ANN) algorithms to find the most semantically similar entries for a query vector in milliseconds. Unlike a keyword index, similarity is measured in geometric space so synonyms and paraphrases match naturally. Common choices include Pinecone, Weaviate, Qdrant, and pgvector for Postgres.

Question 20

What is Retrieval-Augmented Generation (RAG) and why is it used?

Accepted Answer

RAG couples a retrieval step — fetching relevant documents from an external store — with a generative model so the LLM can answer questions about knowledge it was never trained on. It solves the stale-knowledge and hallucination problems without retraining. The pattern is preferred when the knowledge base changes frequently or contains proprietary data.

Question 21

What is tokenization in NLP and why does it matter?

Accepted Answer

Tokenization splits raw text into discrete units — words, subwords, or characters — that a model can process numerically. The strategy chosen controls vocabulary size, out-of-vocabulary rate, and how well the model handles rare or morphologically complex words.

Question 22

Why do dense word embeddings outperform one-hot vectors?

Accepted Answer

One-hot vectors are high-dimensional, sparse, and treat all words as equidistant — they carry zero semantic information. Dense embeddings place similar words close together in a low-dimensional space, enabling models to generalize from seen words to unseen but related ones.

Question 23

What is the difference between encoder models like BERT and decoder models like GPT?

Accepted Answer

BERT is an encoder-only transformer that reads all tokens bidirectionally and is trained with masked language modelling — ideal for tasks requiring a rich contextual representation of an entire sequence, like classification or NER. GPT is a decoder-only transformer that attends only to previous tokens via a causal mask and is trained with next-token prediction — ideal for text generation. Encoder-decoder models like T5 combine both for tasks that map one sequence to another.

Question 24

How does Byte-Pair Encoding (BPE) tokenization work?

Accepted Answer

BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent pair of symbols until a target vocabulary size is reached. The resulting subword units handle rare and unseen words gracefully without any out-of-vocabulary tokens.

Question 25

What is catastrophic forgetting and how does parameter-efficient fine-tuning help avoid it?

Accepted Answer

Catastrophic forgetting is when fine-tuning on a new task overwrites weights and erases previously learned capabilities. Parameter-efficient methods like LoRA freeze the base weights and train only small added parameters, preserving the original knowledge while adapting behavior, and techniques like lower learning rates, replay data, and adapter isolation further reduce forgetting.

Question 26

What causes hallucinations in LLMs and how do you mitigate them?

Accepted Answer

Hallucinations are fluent but unsupported or false outputs, arising because LLMs predict likely text rather than retrieve verified facts and have no built-in grounding. Mitigations include retrieval-augmented grounding with citations, constraining the model to answer only from provided context, lower temperature, verification or self-check steps, and faithfulness-focused evaluation.

Question 27

What chunking strategies exist for RAG and how do you choose between them?

Accepted Answer

Chunking splits source documents into retrievable units before embedding. The right strategy depends on document structure, query style, and the model's context window. Fixed-size chunks are simple but break mid-sentence; semantic or structural chunking preserves coherence; hierarchical chunking enables parent-document retrieval for richer context.

Question 28

What are chunking strategies in RAG, and how do you choose chunk size?

Accepted Answer

Chunking splits documents into retrievable units; strategies include fixed-size windows, overlapping windows, and semantic or structure-aware splitting on sentences or sections. Smaller chunks improve retrieval precision but risk losing context, while larger chunks preserve context but dilute relevance, so chunk size and overlap are tuned to the content and the embedding model's context length.

Question 29

Compare RAG and fine-tuning. When would you use each?

Accepted Answer

RAG injects external knowledge at inference time and is best when information changes often, must be cited, or is too large to bake into weights. Fine-tuning changes model behavior, style, or format and is best for teaching new skills or domain tone; the two are complementary and often combined.

Question 30

What is constrained decoding and how does it guarantee structured outputs like valid JSON?

Accepted Answer

Constrained decoding masks the model's next-token logits at each step so only tokens permitted by a grammar or JSON schema can be sampled, guaranteeing structurally valid output without changing the model's weights. It is how structured-output and function-calling features enforce schema conformance; placing reasoning fields before answer fields lets the model think before it commits.

Question 31

How do you evaluate LLM outputs, and what is LLM-as-a-judge?

Accepted Answer

LLM evaluation combines reference-based metrics like BLEU and ROUGE, task benchmarks like MMLU and HumanEval, and human or model-based judgment of qualities like helpfulness and faithfulness. LLM-as-a-judge uses a strong model to score or compare outputs against a rubric, scaling human-like evaluation cheaply but requiring care because the judge can be unreliable.

Question 32

How do you evaluate the quality of an LLM or RAG system?

Accepted Answer

Evaluation splits into retrieval quality (did we fetch the right chunks?) and generation quality (did the model use them correctly?). Key metrics are context precision/recall for retrieval and faithfulness plus answer relevance for generation. Frameworks like RAGAS automate LLM-as-judge scoring; human annotation anchors the ground truth.

Question 33

Explain the ReAct agent pattern and how it compares to Plan-and-Execute and Reflexion.

Accepted Answer

ReAct interleaves reasoning traces with actions step by step, deciding the next tool call based on the latest observation. Plan-and-Execute first drafts a full multi-step plan and then executes it, which is more efficient and predictable for complex tasks but less adaptive, while Reflexion adds a self-reflection step where the agent critiques past failures and retries with that feedback.

Question 34

When should you use prompt engineering versus fine-tuning to adapt an LLM?

Accepted Answer

Prompt engineering is the right starting point when the task can be described in natural language, the required knowledge already exists in the base model, and iteration speed matters — no training required. Fine-tuning is warranted when you need consistent output format at scale, domain-specific style that prompts cannot reliably impose, or when latency and token costs from long system prompts are prohibitive.

Question 35

What is GGUF, and what does a quantization tier like Q4_K_M mean?

Accepted Answer

GGUF is a single-file format for running LLMs locally, used by llama.cpp and Ollama. Unlike training-oriented formats, it packs weights, tokenizer, and metadata into one memory-mappable file optimized for inference on CPU or partial GPU. Q4_K_M describes the quantization: roughly 4 bits per weight (vs 16 for FP16), using the k-quant method, medium variant, which protects the most important tensors at higher precision. It is the community default because it keeps almost all of the model's quality at about a quarter of the FP16 size.

Question 36

How does GloVe differ from Word2Vec in learning word embeddings?

Accepted Answer

GloVe (Global Vectors) builds a global co-occurrence matrix over the entire corpus and then factorizes it, directly encoding how often pairs of words co-occur. Word2Vec uses local context windows and a prediction objective, never explicitly seeing the global statistics. GloVe tends to capture linear substructures slightly better while Word2Vec handles rare words better with negative sampling.

Question 37

What is hybrid search and why is it often better than pure vector search?

Accepted Answer

Hybrid search combines dense vector similarity with sparse keyword search such as BM25, then fuses the rankings. Dense retrieval captures semantic meaning while keyword search nails exact terms, identifiers, and rare tokens, so combining them improves recall and precision over either alone.

Question 38

What is a KV cache and how does it speed up LLM inference?

Accepted Answer

During autoregressive generation, attention recomputes Keys and Values for all previous tokens at every step; the KV cache stores those K and V tensors so each new token only computes its own, turning per-step cost from quadratic to linear in sequence length. The tradeoff is memory growth proportional to sequence length and batch size.

Question 39

In LlamaIndex, what are nodes and query engines, and how is RAG exposed as a tool to an agent?

Accepted Answer

In LlamaIndex a Node is a chunk of a source document with metadata and relationships, indexed for retrieval; a query engine wraps an index to take a natural-language query, retrieve relevant nodes, and synthesize an answer. RAG-as-a-tool wraps a query engine in a QueryEngineTool so an agent can call it like any other tool, deciding when to retrieve from that knowledge source as part of its reasoning loop.

Question 40

What is LoRA and how does it make fine-tuning parameter-efficient?

Accepted Answer

LoRA freezes the pretrained weights and injects small trainable low-rank matrices into selected layers, learning the weight update as their low-rank product. This trains a tiny fraction of parameters, slashing memory and storage while approximating full fine-tuning, and the adapters can be merged back at inference.

Question 41

What is the Model Context Protocol (MCP) and what problem does it solve?

Accepted Answer

MCP is an open protocol from Anthropic that standardizes how LLM applications discover and connect to external tools, data sources, and prompts through a common client-server interface. It replaces bespoke per-integration glue with a single protocol, so any MCP-compatible host can use any MCP server, and has been adopted across the broader ecosystem.

Question 42

How do multimodal vision-language models combine images and text, and what role does CLIP play?

Accepted Answer

Vision-language models encode images with a vision encoder and project those features into the language model's token space so it can reason over images and text jointly, often via a connector or projection layer. CLIP is a contrastively trained image-text model that aligns image and text embeddings in a shared space, widely used as the vision backbone or for zero-shot retrieval and grounding.

Question 43

What are out-of-vocabulary (OOV) words and how do modern NLP systems handle them?

Accepted Answer

OOV words are tokens unseen during vocabulary construction that a model cannot look up in its embedding table. Classical word-level models replace them with a generic UNK token, losing all information, while subword tokenizers (BPE, WordPiece) eliminate OOV entirely by decomposing any word into known subunits.

Question 44

What is the difference between pretraining, fine-tuning, instruction-tuning, and RLHF?

Accepted Answer

Pretraining teaches a model general language structure by predicting tokens across a massive corpus; fine-tuning adapts the pretrained weights to a narrower task or domain using supervised data; instruction-tuning is supervised fine-tuning specifically on (instruction, response) pairs so the model follows directives; RLHF further aligns the model to human preferences by training a reward model on ranked responses and using it as a signal for policy optimisation with PPO or a similar algorithm.

Question 45

What is prompt injection and how do you defend against it?

Accepted Answer

Prompt injection is an attack in which malicious text in retrieved documents or user input overrides the application's system instructions, redirecting the model to perform unintended actions. Defenses layer input/output validation, privilege separation, and tool-call confirmation — no single fix is sufficient.

Question 46

What is prompt injection, and what is the difference between direct and indirect injection?

Accepted Answer

Prompt injection is an attack where adversarial instructions override the system's intended behavior. Direct injection comes from the user input itself, such as ignore previous instructions, while indirect injection hides malicious instructions in external content the model ingests, such as a web page, document, or tool output, that the model then follows.

Question 47

What are reasoning models, and what is test-time compute?

Accepted Answer

Reasoning models are trained to produce an extended chain of thought before answering, often via reinforcement learning, so they spend more computation deliberating on hard problems. Test-time compute is the idea of improving answer quality by allocating more inference-time compute, for example longer reasoning chains, sampling multiple solutions, or self-verification, rather than only scaling parameters.

Question 48

What techniques reduce LLM cost and latency in production?

Accepted Answer

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

Question 49

What is the difference between retrieval and reranking in a RAG pipeline?

Accepted Answer

Retrieval cheaply searches a large corpus and returns a candidate set, prioritizing recall. Reranking applies a more expensive query-document model to that small set and improves precision and ordering at the top. A reranker cannot recover relevant documents absent from the retrieved candidates, so evaluate first-stage recall separately.

Question 50

What is hybrid search and when should you use semantic vs keyword retrieval?

Accepted Answer

Keyword search (BM25) excels at exact term matching — product codes, proper nouns, rare abbreviations. Semantic search (dense embeddings) captures meaning and handles paraphrases. Hybrid search runs both in parallel and merges scores with Reciprocal Rank Fusion, giving the best of both worlds for most production RAG systems.