Reference

Data & AI glossary

245 terms across data science, statistics, machine learning, deep learning, and AI — defined in plain English, no circular jargon.

A

A/B Testing: A controlled experiment that randomly splits users into two groups — one receiving the current version (control) and one receiving a change (treatment) — then measures whether the difference in outcome is larger than sampling noise could explain. Randomization is what lets you attribute the difference to the change rather than to pre-existing differences between users. Learn more → See also: A/B Testing for Decisions · A/B testing & experimentation
ACID: Four guarantees a database transaction must satisfy: Atomicity (the whole operation succeeds or none of it does), Consistency (data always moves from one valid state to another), Isolation (concurrent transactions do not interfere), and Durability (committed data survives crashes). Learn more → See also: Delta Lake & MERGE · Durable execution for agents
Activation Function: A mathematical gate applied to a neuron's summed inputs that decides how strongly the neuron fires. Without it, stacking layers would collapse to a single linear transformation; non-linear activations like ReLU or sigmoid let networks learn curves, boundaries, and hierarchical abstractions. Learn more → See also: Multi-Layer Perceptron & Activations · Perceptron & the Update Rule
Adam: An optimizer that maintains a separate adaptive learning rate for each weight by tracking both the average gradient and the average squared gradient. This makes it self-tuning across parameters of very different scales, and it is the default optimizer for most deep learning work today. Learn more → See also: Gradient descent · Gradient Descent (One Step)
Agent: An AI system that uses a language model as its reasoning engine to plan and execute multi-step tasks, calling tools—web search, code execution, APIs—and deciding what to do next based on intermediate results. Unlike a single prompt-response exchange, an agent loops: observe, think, act, observe again, until the goal is reached. See also: What agentic AI means · AGENTS.md, Skills & Tools
Agent Trajectory: The sequence of steps an agent takes — its tool calls, arguments, results, and reasoning. Agent evaluation scores both the outcome (did it succeed) and the trajectory (right tools, valid args, no loops or unsafe actions), since a correct answer can come from a wasteful or dangerous path. See also: Evaluating agents · AGENTS.md, Skills & Tools
Aggregate Function: A function that takes a set of rows as input and returns a single summary value — such as SUM, COUNT, AVG, MIN, or MAX. Aggregates collapse many rows into one scalar per group. Learn more → See also: Aggregations & axis · Window functions
Airflow: An open-source workflow orchestration platform where pipelines are defined as Directed Acyclic Graphs (DAGs) in Python code. Each node in the DAG is a task; Airflow schedules, monitors, and retries tasks, providing a web UI to visualize pipeline runs. Learn more → See also: Pipeline orchestration · Kubeflow Pipelines
API: Application Programming Interface — a defined contract that lets two programs exchange data or trigger actions without either needing to know the other's internal code. In data work, 'calling an API' almost always means sending an HTTP request to a web service and receiving structured data back. The API hides complexity: you ask for weather data; the service figures out how to retrieve and format it. See also: A2A — Agent2Agent Protocol · FastAPI
apply vs vectorize: `DataFrame.apply()` runs a Python function row-by-row or column-by-column — flexible but slow because Python overhead accumulates over thousands of calls. NumPy's `np.vectorize()` is similar in spirit but still loops under the hood. For real speed, replace both with a built-in pandas or NumPy operation that is already implemented in C. Reserve `apply` for logic too complex to express as a vectorised expression. See also: Pandas UDFs · Why NumPy
ARIMA: AutoRegressive Integrated Moving Average — a classical statistical model for forecasting a univariate time series using its own past values (AR), past forecast errors (MA), and a differencing step (I) to remove trends and make the series stationary. ARIMA(p, d, q) notation specifies the number of autoregressive lags p, differencing operations d, and moving-average terms q. Learn more → See also: SARIMA (seasonal) · Autoregression (AR)
Autoencoder: A neural network trained to compress an input through a narrow bottleneck layer and then reconstruct it, forcing the network to distill the most essential information. The bottleneck layer's activations form a compact latent representation used for tasks like denoising, anomaly detection, and as the backbone of more advanced generative models. See also: VAEs from scratch · The Transformer Architecture

B

Backpropagation: The algorithm that trains neural networks by computing how much each weight contributed to the final error, then nudging every weight in the direction that reduces that error. It works by applying the chain rule of calculus backward through every layer, from the loss all the way to the first layer's weights. See also: Backpropagation foundations · Backpropagation (One Step)
Basis: A minimal set of vectors that is both independent and spans the space — a non-redundant coordinate system. The number of vectors in a basis is the dimension of the space. See also: Independence, Span, Basis & Dimension · Rank, independence & basis
Batch Normalization: A technique that standardizes each layer's activations to zero mean and unit variance across the current mini-batch, then lets the network learn an optional rescaling. This stabilizes training by reducing internal covariate shift, often allowing higher learning rates and making the network less sensitive to weight initialization. Learn more → See also: Batch size ↔ learning rate · Vanishing & exploding gradients
Batch Processing: A data processing model where a bounded set of records is accumulated over a period and then processed all at once as a single job. Batch jobs maximize throughput and are ideal for scheduled reporting but introduce latency equal to the collection interval. See also: Batch vs real-time inference · Queues & batch pipelines
Batch Size: The number of training examples processed together before the model's weights are updated once. Larger batches give more accurate gradient estimates but require more memory; smaller batches introduce noise that can help escape local minima and often generalize better. See also: Batch size ↔ learning rate · Distillation
Bayes' Theorem: A formula for updating a prior belief in light of new evidence: posterior probability is proportional to the prior times the likelihood of the evidence given that prior. It formalizes rational belief revision and underpins Bayesian inference, spam filters, and probabilistic classifiers like Naive Bayes. Learn more → See also: Bayes' Theorem · Conditional & Total Probability
BERT: A transformer encoder pre-trained on masked language modeling—predicting randomly hidden words using both left and right context—plus next-sentence prediction. Its bidirectional representations set a new standard on nearly every NLP benchmark when released, and fine-tuning BERT on a task-specific dataset requires far less labeled data than training from scratch. Learn more → See also: Sentence embeddings: SBERT · The Transformer Architecture
Bias-Variance Tradeoff: The fundamental tension in model design: a model with high bias is too rigid and misses real patterns (underfits), while one with high variance is too sensitive to training noise (overfits). The goal is finding the sweet spot where total error — bias squared plus variance — is minimized. Learn more → See also: The Bias-Variance Trade-off · Bias–variance & learning curves
BPE: Byte-Pair Encoding is a tokenization algorithm that starts with individual characters and iteratively merges the most frequent adjacent pair into a new token, repeating until the vocabulary reaches a target size. The result represents common words as single tokens and rare words as sequences of subword pieces, making it efficient across many languages. See also: Tokenization & BPE · Sentence embeddings: SBERT
Broadcast Join: A join strategy in which a small table is copied in full to every worker node so that the large table's rows can be matched locally without any shuffle. It eliminates expensive network data movement when one side of the join fits comfortably in memory. See also: Joins in Spark · Joins & Division
Broadcasting: NumPy's rule for performing arithmetic between arrays of different shapes by 'stretching' the smaller array along dimensions of size 1. Subtracting a column mean from every row of a matrix — `matrix - means` — works automatically without writing a loop or manually repeating the means. Broadcasting makes data normalisation concise and efficient. Learn more → See also: Broadcasting · dot, matmul, @

C

Cardinality: The number of distinct values in a column relative to the total number of rows. High-cardinality columns like user IDs have nearly as many unique values as rows; low-cardinality columns like country codes have very few. Cardinality guides index design, join strategy selection, and partition pruning. See also: Rank, Nullity & Solution Sets · Keys & Integrity Constraints
CDC (Change Data Capture): A technique for detecting and capturing row-level inserts, updates, and deletes from a source database as they happen, typically by reading the database's write-ahead log. Downstream systems receive a continuous, ordered stream of changes rather than periodic full-table snapshots. Learn more → See also: Slowly Changing Dimensions · Normalization, Discretization, Sampling, Compression
Central Limit Theorem: A foundational result stating that the mean of a large enough random sample will be approximately normally distributed, regardless of the shape of the original population distribution. This is why so many statistical tests assume normality even when the underlying data is skewed or discrete. Learn more → See also: Central Limit Theorem & Confidence Intervals · Normal & Standard Normal
Chain-of-Thought: A prompting strategy that asks a language model to write out its intermediate reasoning steps before giving a final answer, mimicking how a person 'shows their work.' The explicit reasoning trace improves accuracy on multi-step arithmetic, logic, and common-sense problems where jumping straight to an answer fails. See also: Tree of Thoughts · Few-shot & chain-of-thought
Closure: A nested function that remembers the variables from its enclosing scope even after that outer function has returned. It is how Python implements stateful callbacks without a class. Decorators rely on closures to 'wrap' a target function and keep a reference to it. See also: Decorators · Recursion & the Call Stack
CNN: A Convolutional Neural Network applies small learned filters that slide across an input—typically an image—to detect local patterns such as edges, textures, and shapes regardless of where they appear. Stacking convolution layers followed by pooling lets the network build up from low-level features to high-level semantic concepts. See also: Convolutional neural networks · Vision Transformers (ViT)
Collaborative Filtering: A recommendation technique that predicts a user's preferences by finding other users with similar taste and recommending items those users liked. It requires no knowledge of item content — just the history of who liked what. The key weakness is the cold-start problem: new users or new items with no interaction history receive poor recommendations. Learn more → See also: User-based collaborative filtering · Item-based collaborative filtering
Columnar Storage: A physical file layout that stores all values of a single column together on disk rather than grouping columns by row. Analytical queries that touch only a few columns read far less data, and values within a column compress much more efficiently than mixed-type rows. Learn more → See also: Star vs Snowflake Schemas · Warehouse, Lake & Lakehouse
Compaction: Summarizing an agent's old conversation turns into a compact note and dropping the raw history, sharply lowering token count while preserving the gist. Reported to cut tokens ~84% over a long (100-turn) run, preventing context overflow and rot. See also: Context engineering · Code execution with MCP
Confidence Interval: A range constructed from sample data such that, if you repeated the study many times, the interval would contain the true population parameter in a specified percentage of runs (e.g., 95%). A wide interval signals high uncertainty; a narrow one signals a precise estimate. Learn more → See also: Estimation & confidence intervals · Central Limit Theorem & Confidence Intervals
Confused Deputy: A security flaw where a low-privilege agent tricks a high-privilege agent into performing an action on its behalf, so the trusted agent misuses authority it shouldn't lend. Combined with prompt injection, it's a primary multi-agent breach path; defended with per-hop identity verification and scoped delegated tokens. See also: Agent Security · MCP tool poisoning & supply-chain security
Confusion Matrix: A table that breaks predictions into four cells — true positives, true negatives, false positives, and false negatives — giving a complete picture of where a classifier succeeds and fails. Every classification metric (precision, recall, F1, accuracy) can be derived from it. Learn more → See also: Confusion Matrix, Precision, Recall, ROC · ML-specific plots
Constrained Decoding: Generating text under a grammar or finite-state machine that masks every illegal next token to zero probability before sampling, guaranteeing the output matches a schema (e.g. valid JSON). Engines like XGrammar make it near-zero overhead; it underlies structured-output modes and reliable tool calling. See also: Constrained decoding · Speculative Decoding
Context Engineering: Curating what is in an agent's context window — via compaction, isolation, and just-in-time retrieval — so a long-running agent stays lean, coherent, and affordable. The headline 2026 agent skill, distinct from prompt wording. See also: Context engineering · AGENTS.md, Skills & Tools
Context Window: The maximum number of tokens—prompt plus generated output combined—that a transformer model can process in one forward pass, determined at training time. Text beyond this limit is simply invisible to the model; expanding context windows is an active research area because longer contexts enable document-level reasoning and multi-turn memory. See also: Context engineering · Multi-token prediction
Continuous Batching: An LLM-serving scheduler that works at the token-step level: the instant a sequence finishes it frees its slot and a waiting request takes over, so the GPU never idles waiting for the longest request. Together with PagedAttention it gives roughly 10–20x the throughput of static batching. See also: KV cache & continuous batching · Load balancing LLM inference
Convexity: A function is convex if the chord between any two points lies on or above its graph — a single bowl with no local minima besides the global one. Convex losses (linear, logistic, SVM) train reliably from any start. See also: Convexity · Convexity & Single-Variable Optimization
Convolution: A mathematical operation where a small filter matrix is slid across an input grid, computing a dot product at every position to produce a feature map that highlights where the filter's pattern appears. In neural networks, the filter values are learned during training rather than designed by hand. See also: Convolutional neural networks · Matrix multiplication
Correlation: Covariance normalized by the two standard deviations, landing in [−1, 1]. +1 is a perfect upward line, −1 downward, 0 no linear relationship. It measures only linear association, not causation. See also: Covariance, Correlation & Total Expectation · Covariance & correlation
Cosine Similarity: A measure of how similar two vectors are, computed as the cosine of the angle between them — ranging from –1 (opposite) through 0 (unrelated) to 1 (identical direction). In recommenders and NLP, items or documents are represented as vectors and cosine similarity finds the closest matches regardless of their magnitude, so a short and a long document about the same topic still score highly. Learn more → See also: Norms & distances · Embeddings
Covariance Matrix: A square matrix holding every pair of feature covariances — variances on the diagonal, covariances off it. Symmetric and positive semi-definite; its eigenvectors are the principal directions PCA finds. See also: Covariance & correlation · PCA & Dimensionality Reduction
Cross-Entropy: The average cost of encoding the true distribution p with a model q, −Σ p log q. Minimized when q = p; it is exactly the classification loss and equals entropy plus the KL divergence. See also: Entropy & information theory · KL Divergence
Cross-Validation: A technique for estimating how well a model generalizes by splitting the data into multiple folds, training on some folds and evaluating on the held-out fold, then rotating which fold is held out. The resulting average score is far more reliable than a single train-test split because every example gets a turn as test data. Learn more → See also: Cross-Validation: k-fold, LOO, Stratified · Model selection & nested CV
CTE (Common Table Expression): A named, temporary result set defined at the top of a query with the WITH keyword, scoped to that single statement. CTEs let you break a complex query into readable, named building blocks rather than nesting subqueries. Learn more → See also: Recursive CTEs · Subqueries

D

Data Cleaning: The process of detecting and correcting errors, inconsistencies, and missing values in a raw dataset so it is fit for analysis. Typical tasks include standardising date formats, fixing misspelled category labels, removing duplicate rows, and handling nulls. Analysts routinely spend 60–80 % of project time here because model quality cannot exceed data quality. See also: Normalization, Discretization, Sampling, Compression · Pretraining data: curation & dedup
Data Lake: A storage repository that holds raw data in its native format — structured, semi-structured, or unstructured — at virtually unlimited scale and low cost. Schema is applied only when the data is read, giving flexibility at the expense of governance if left unmanaged. Learn more → See also: Delta Lake & MERGE · Star vs Snowflake Schemas
Data Leakage: The accidental inclusion of information from the future or from the test set into the training process, causing a model to look better than it really is. Classic examples include scaling features using test-set statistics or including a column that is only available after the prediction target is already known. Learn more → See also: Training-Serving Skew · Differential privacy for LLMs
Data Modeling: The discipline of deciding how to represent real-world entities and their relationships as database tables, columns, and constraints. A good model makes queries intuitive, prevents data anomalies, and reflects the way the business actually uses the data. Learn more → See also: ER Model & Mapping to Relations · The Relational Model
Data Pipeline: A sequence of automated steps that moves data from one or more sources through transformations to a destination. Each step's output is the next step's input, creating a reliable, repeatable flow from raw source to consumable dataset. See also: Change Data Capture (CDC) · Orchestration: Airflow & DAGs
Data Skew: An uneven distribution of data across partitions in a distributed system, where one or a few partitions hold far more rows than the rest. Skew causes some workers to finish in seconds while others process for hours, making them the bottleneck for the entire job. See also: Skew & salting · Training-Serving Skew
Data Warehouse: A central repository that integrates structured, cleansed data from multiple operational systems, organized for analytical queries rather than transactional writes. Warehouses use column-oriented storage and schemas like star or snowflake to accelerate business intelligence workloads. Learn more → See also: Star vs Snowflake Schemas · Reverse ETL
Data Wrangling: Reshaping and transforming raw data into the structure a model or analysis requires — broader than cleaning alone. Wrangling includes merging tables from multiple sources, pivoting from long to wide format, computing derived columns, and encoding categorical variables. The term emphasises the messy, iterative nature of real-world data preparation. See also: Normalization, Discretization, Sampling, Compression · pivot, melt, stack
Dataclass: A class decorated with `@dataclass` that auto-generates boilerplate methods — `__init__`, `__repr__`, `__eq__` — from the field annotations you declare. It keeps data-container code concise and readable. Dataclasses are ideal for typed record objects such as a `Transaction(amount: float, currency: str)` without writing repetitive constructor code. Learn more → See also: DataFrame basics · Classes & Instances
DataFrame: The core pandas data structure: a two-dimensional table with labelled rows and columns, like a spreadsheet you can manipulate with code. Each column is a Series sharing a common row index. DataFrames support SQL-style queries, joins, aggregations, and direct import from CSV, Excel, JSON, and databases. Learn more → See also: DataFrame intro · Series
Decision Tree: A model shaped like a flowchart that splits data by asking a series of yes/no questions about features, assigning a prediction at each leaf. Trees are highly interpretable but prone to overfitting; in practice they are almost always combined into ensembles like random forests or gradient-boosted trees. Learn more → See also: Decision Trees · Decision Trees: Entropy, Gini & Info Gain
Decoder: The portion of a model that generates output one step at a time, attending to both previously generated tokens and (in encoder-decoder architectures) the encoder's representation of the input. GPT-style language models are decoder-only, auto-regressively predicting the next token until a stopping condition is met. Learn more → See also: Speculative Decoding · Sequence-to-sequence models
Decorator: A function that wraps another function to add behaviour — logging, timing, authentication — without touching the original code. In Python it is written as `@my_decorator` on the line above a function definition. Decorators are a clean way to separate cross-cutting concerns from core logic. Learn more → See also: Dunder Methods · FastAPI
Denormalization: The deliberate introduction of redundancy into a schema by merging or pre-joining tables so read queries need fewer joins. It trades write complexity and storage for faster analytical query performance. See also: Normalization, Discretization, Sampling, Compression · Lossless-Join vs Dependency-Preservation
Dictionary: Python's built-in key-value store, written as `{key: value, ...}`. Looking up a value by key is O(1) on average, making dicts the go-to structure for fast lookups, counting frequencies, and grouping data. Keys must be immutable (strings, numbers, tuples); values can be anything. See also: Dictionaries · Hash Tables & Dicts
Diffusion Model: A generative model trained to reverse a gradual noise-adding process: it learns to predict and remove small amounts of Gaussian noise step by step until a clean sample emerges from pure noise. This iterative denoising produces strikingly high-quality images and audio and underpins tools like Stable Diffusion and DALL·E 3. Learn more → See also: Diffusion models (DDPM) · Latent diffusion & Stable Diffusion
Dimension Table: A table in a dimensional model that describes the who, what, where, and when of events stored in a fact table. Dimension tables are relatively small, change slowly, and hold descriptive attributes like customer name, product category, or store region. Learn more → See also: Slowly Changing Dimensions · The Relational Model
Dimensionality: The number of features (columns) in a dataset. High dimensionality creates the 'curse of dimensionality': data points become sparse in high-dimensional space, distances lose meaning, and models overfit. Techniques like PCA, feature selection, and embeddings reduce dimensionality to keep models accurate and computationally tractable. See also: Curse of Dimensionality · PCA & dimensionality reduction
Dropout: A regularization technique that randomly deactivates a fraction of neurons on each training step, forcing the network to learn redundant representations rather than relying on any single pathway. At inference time, all neurons are kept active but their outputs are scaled to compensate, giving an ensemble-like effect without the cost of training multiple models. Learn more → See also: Multi-Layer Perceptron & Activations · Ridge Regression & Regularization

E

EDA (Exploratory Data Analysis): An open-ended investigation of a new dataset using summary statistics, distributions, and visualisations to understand its structure, spot anomalies, and form hypotheses before building any model. EDA surfaces problems — wrong data types, unexpected outliers, class imbalances — that would silently ruin a model trained without it. It is the non-negotiable first step in any data project. See also: Heatmaps & pairplot · Why visualization matters
ELT (Extract, Load, Transform): A variant of ETL where raw data is loaded into the destination warehouse first, and transformations are performed inside that warehouse using its own compute. This approach exploits the massive parallel processing power of modern cloud warehouses. Learn more → See also: Reverse ETL · Warehouse, Lake & Lakehouse
Embedding: A dense, low-dimensional vector that represents a discrete object—a word, token, or entity—in a continuous space where geometric distance reflects semantic similarity. Embeddings are learned during training and compress sparse one-hot identifiers into compact representations that the rest of the network can reason over. See also: Embeddings · Embeddings
Encoder: The portion of a model that reads the input and compresses it into a meaningful internal representation—a sequence of vectors in transformers, or a latent code in autoencoders. BERT-style models are encoder-only, excelling at tasks that require understanding the full input before producing an output. Learn more → See also: Sentence embeddings: SBERT · Embeddings
Entropy: The average surprise of a distribution, −Σ p log p — the floor on how many bits are needed to encode its outcomes. Maximal for a uniform distribution, zero when one outcome is certain. See also: Entropy & information theory · Expectation, Variance & SD
Epoch: One complete pass through the entire training dataset, after which the model has seen every example once. Training typically runs for many epochs, with the model improving each time until validation performance stops rising or starts degrading. See also: The training loop · Cross-Validation: k-fold, LOO, Stratified
ETL (Extract, Transform, Load): A data movement pattern where data is pulled from source systems, cleaned and reshaped in a separate processing engine, and then loaded into the destination store in its final form. Transformation happens before the data lands in the target. Learn more → See also: Reverse ETL · Normalization, Discretization, Sampling, Compression
Exploding Gradient: When gradients grow exponentially as they propagate backward through many layers, causing huge weight updates that destabilize training (often producing NaN loss). The mirror image of the vanishing gradient problem; the standard remedy is gradient clipping plus careful initialization and a lower learning rate. See also: Vanishing & exploding gradients · The training loop

F

F1 Score: The harmonic mean of precision and recall, ranging from 0 to 1. It gives a single number that balances both concerns, and is especially useful when classes are imbalanced and plain accuracy would be misleading. Learn more → See also: Confusion Matrix, Precision, Recall, ROC · Evaluating generative models
Fact Table: The central table in a dimensional model that stores measurable, quantitative events — such as sales transactions or web page views — along with foreign keys linking to surrounding dimension tables. Fact tables are typically very wide and very tall. Learn more → See also: Star vs Snowflake Schemas · Slowly Changing Dimensions
Faithfulness: A grounding metric: the fraction of an answer's atomic claims that are supported by the source. Measured by decomposing the answer into claims and labeling each grounded, inferred (plausible but unsupported), or contradicted. The 'inferred' bucket is where subtle hallucinations hide. See also: Hallucination & grounding · Natural language inference
Feature: An input variable fed into a machine-learning model — a column in your training table that the model can use to make predictions. Raw data is rarely ready to use directly; feature engineering transforms it into meaningful signals, for example converting a raw timestamp into 'day of week' or 'hours since last purchase'. See also: Feature engineering & encoding · Lag & rolling features
Feature Engineering: The craft of transforming raw data into inputs that expose useful signal for a model — creating ratio features, extracting day-of-week from a timestamp, binning ages into groups, or combining columns into meaningful interactions. Good feature engineering often matters more than model choice. See also: Feature engineering & encoding · Feature selection
Feature Scaling: Rescaling input features so they share a comparable numeric range, preventing features with large units (e.g., income in dollars) from dominating those with small ones (e.g., age in years). Algorithms that rely on distances or gradients — like KNN, SVM, or neural networks — require this; tree-based models do not. See also: Feature engineering & encoding · Lag & rolling features
Few-Shot Learning: Getting a model to perform a task correctly by including a small number of worked examples directly in the prompt, with no gradient updates to the model's weights. Large language models can generalize from as few as one to five examples because pre-training exposed them to the same patterns in diverse contexts. See also: Few-shot & chain-of-thought · Multi-token prediction
Fine-Tuning: Continuing to train a pre-trained model on a smaller, task-specific dataset so its weights shift toward the new domain without forgetting everything learned during pre-training. Fine-tuning is orders of magnitude cheaper than training from scratch and typically outperforms both random initialization and prompt-only approaches on specialized tasks. See also: Fine-tuning: LoRA & QLoRA · Distillation
Forecasting: Predicting future values of a quantity based on its past behaviour and, optionally, related external variables. Unlike classification or regression on static data, forecasting must respect temporal order — you can never train on future data and test on the past. Common approaches range from ARIMA and exponential smoothing to gradient-boosted trees and neural networks. See also: Evaluating forecasts (walk-forward) · Smoothing & Forecast Error
Foreign Key: A column in one table whose values must match an existing primary key value in another table. It enforces referential integrity: you cannot store an order for a customer who does not exist. See also: Keys & Integrity Constraints · Functional Dependencies & Closure
Foundation Model: A large model pre-trained on broad, diverse data at enormous scale, intended as a general-purpose base that can be adapted—via fine-tuning, prompting, or retrieval—to a wide range of downstream tasks. The term captures the architectural shift from training task-specific models from scratch to building everything on a shared, reusable backbone. See also: Establishing baselines · Pretraining data: curation & dedup
FSDP (Fully Sharded Data Parallel): A multi-GPU training strategy that shards a model's parameters, gradients, and optimizer states across GPUs so each holds only a fraction, gathering full weights for a layer only momentarily. Unlike DDP (which replicates the full model on every GPU), FSDP's per-GPU memory drops roughly linearly with the number of GPUs, enabling training of models too large to fit on one device. See also: Multi-GPU: DDP & FSDP · Pipeline parallelism

G

GAN: A Generative Adversarial Network pits two networks against each other: a generator that produces fake samples and a discriminator that tries to distinguish fakes from real data. Training drives the generator to produce increasingly realistic outputs until the discriminator can no longer tell them apart, yielding a model capable of synthesizing images, audio, or video. Learn more → See also: GANs from scratch · Evaluating generative models
Gaussian Elimination: The algorithm that solves a linear system by applying elementary row operations (swap, scale, add a multiple) until the matrix reaches echelon form. It is what np.linalg.solve runs under the hood. See also: Linear systems & RREF · Systems of Equations & Gaussian Elimination
Generator: A function that yields values one at a time instead of computing and returning them all at once. Because it produces each item only when asked, a generator uses a fraction of the memory that a list would — critical when processing millions of rows or reading large files. You recognise one by the `yield` keyword inside the function body. Learn more → See also: Comprehensions · Generative agents: simulations
GPT: A transformer decoder pre-trained to predict the next token in a sequence, trained on massive web text with no task-specific labels. Scaling GPT—more parameters, more data, more compute—produces models that can write, reason, code, and follow instructions without ever being explicitly taught those tasks. Learn more → See also: Language models before transformers · Multi-token prediction
Gradient Boosting: An ensemble method that builds trees sequentially, where each new tree is trained to correct the residual errors left by all previous trees. Because each tree targets what the model currently gets wrong, the ensemble improves steadily — making gradient boosting (e.g., XGBoost, LightGBM) the dominant algorithm in tabular-data competitions. Learn more → See also: Bagging, boosting & stacking · Decision trees
Gradient Clipping: A technique that caps the size of gradients before the optimizer step: if the global gradient norm exceeds a threshold, all gradients are rescaled down by the same factor, preserving direction. It is the standard fix for exploding gradients, common in RNNs and large-learning-rate training. See also: Vanishing & exploding gradients · Gradient descent
Gradient Descent: An iterative optimization algorithm that repeatedly nudges a model's parameters in the direction that most steeply reduces the loss function, like descending a hill by always stepping downhill. The size of each step is controlled by the learning rate. Learn more → See also: Gradient Descent (One Step) · The training loop
GROUP BY: A SQL clause that partitions rows into buckets sharing the same value in one or more columns, then applies an aggregate function to each bucket. Every column in the SELECT that is not inside an aggregate must appear in the GROUP BY list. Learn more → See also: ORDER BY, LIMIT, DISTINCT · GroupBy
GroupBy: A pandas operation that splits a DataFrame into groups by the unique values of one or more columns, applies an aggregation or transformation to each group, and combines the results — the split-apply-combine pattern. `df.groupby('country')['revenue'].sum()` computes total revenue per country in one readable line. Learn more → See also: Aggregates & GROUP BY · Method chaining
Guardrails: Layered safety controls around an LLM: input guardrails (filter/classify the request), instruction hierarchy (separate trusted instructions from untrusted data), output guardrails (scan the response/action before it takes effect), and least privilege (limit tool scope). No single layer is sufficient — they are stacked as defense in depth. See also: Guardrails & output validation · Prompt injection & guardrails

H

Hallucination: When a language model generates factually wrong or entirely fabricated information with the same confident tone it uses for accurate facts. It happens because the model is trained to produce plausible-sounding token sequences, not to verify claims against a ground truth, making retrieval augmentation and careful prompting necessary safeguards. See also: Hallucination & grounding · Generative agents: simulations
Handoff: In agent frameworks (e.g. the OpenAI Agents SDK), one agent delegating the conversation to another, more specialized agent. Implemented as a transfer tool call, so it appears in the run trace like any other action — the basis of triage-to-specialist routing. See also: Agents, handoffs & guardrails · AGENTS.md, Skills & Tools
Hessian: The matrix of second-order partial derivatives of a function — its curvature in every direction. Its eigenvalues classify a critical point as a minimum (all positive), maximum (all negative), or saddle (mixed). See also: Jacobian, Hessian & Taylor · Maxima, Minima & the 2nd-Derivative Test
Hyperparameter: A configuration setting chosen before training that governs how the learning algorithm works — such as the number of trees in a forest, the depth of a neural network, or the regularization strength. Unlike model parameters (which are learned from data), hyperparameters are set by the practitioner, often via cross-validated search. See also: Hyperparameter tuning · Overfitting & bias–variance

I

Idempotency: The property of an operation that produces the same result no matter how many times it is executed. A pipeline step is idempotent if re-running it after a failure does not duplicate or corrupt data. See also: Durable execution for agents · Independent vs Mutually Exclusive
Index: A separate data structure the database maintains alongside a table so it can find rows matching a condition without scanning every row. Think of it as a book's back-index: instead of reading every page, you jump straight to the right one. See also: File Organization & Indexing · Indexes, query engines & retrievers
Inference: Running a trained model on new inputs to produce predictions, as opposed to the training phase where weights are updated. Inference speed and cost dominate production AI systems because a model may be trained once but queried billions of times, making efficiency techniques like batching, quantization, and caching economically critical. See also: Batch vs real-time inference · Approximate Inference: Sampling
Inner Join: A JOIN that returns only the rows for which a matching row exists in both tables. Rows that have no counterpart in the other table are silently excluded from the result. Learn more → See also: LEFT, RIGHT, FULL · Anti-joins
Iterator: Any object that delivers one item at a time via `next()` and raises `StopIteration` when exhausted. For-loops, comprehensions, and `zip()` all work by calling `next()` behind the scenes. Iterators let Python process sequences of any length without loading everything into memory at once. See also: Iterators · Generators

J

Jacobian: The matrix of all first-order partial derivatives of a vector-valued function, with each output's gradient as a row. Backpropagation is a product of Jacobians applied via the chain rule. See also: Backpropagation foundations · Backpropagation (One Step)
JOIN: An operation that combines rows from two or more tables based on a matching condition between their columns. The result is a single virtual table that can draw columns from all participating sources. Learn more → See also: Joins & Division · LEFT, RIGHT, FULL
JSON: JavaScript Object Notation — a lightweight, human-readable text format for structured data built from nested objects (key-value pairs) and arrays. It has become the default language for web APIs. Python's `json` module converts a JSON string into a Python dict in one call: `data = json.loads(response.text)`. See also: Structured outputs · Constrained decoding

K

K-Means Clustering: An unsupervised algorithm that partitions data into k groups by repeatedly assigning each point to its nearest cluster center, then recalculating the centers as the mean of their assigned points, until assignments stop changing. The user must specify k in advance, which is often determined by examining the inertia curve (the 'elbow method'). See also: k-means & k-medoid Clustering · K-means clustering
K-Nearest Neighbors (KNN): A simple algorithm that classifies or regresses a new point by taking a vote (or average) among its k closest training examples in feature space. There is no explicit training phase — the entire dataset is the model — so prediction is slow on large datasets, and the method is sensitive to the scale of features. See also: K-nearest neighbors · k-Nearest Neighbours
Kafka: A distributed, fault-tolerant event-streaming platform that stores ordered streams of messages in topics and delivers them to consumers at high throughput. Producers append messages to a topic's log; each consumer group reads from its own offset, so messages can be replayed without coordination. See also: Sessionization · Bloom Filters, HyperLogLog & MinHash-LSH
KV Cache: The stored attention keys and values of previously generated tokens, kept in GPU memory so each autoregressive step only processes the new token instead of recomputing attention over the whole sequence. It dominates inference memory and is the reason LLM serving is memory-bound. See also: KV cache & continuous batching · KV cache offloading & memory tiers

L

L1 Regularization (Lasso): A regularization method that adds the sum of absolute values of the model's weights to the loss function. Because it can push individual weights all the way to zero, it acts as an automatic feature selector — useful when you suspect only a handful of features truly matter. Learn more → See also: Ridge Regression & Regularization · Lagrange multipliers
L2 Regularization (Ridge): A regularization method that adds the sum of squared weights to the loss function, pushing all weights toward zero but rarely to exactly zero. It handles correlated features gracefully and is the default choice when you want smooth, stable coefficients rather than feature elimination. Learn more → See also: Ridge Regression & Regularization · Maximum likelihood & MAP
Label: The known answer for a training example in a supervised learning problem — the column your model is trying to predict. In an email spam classifier the label is 'spam' or 'not spam'; in a house-price model it is the actual sale price. Models learn by comparing their guesses against labels and adjusting until errors are small. See also: Sequence labeling: NER & POS · Logistic regression
Lagrange Multiplier: The scalar λ linking the gradients of the objective and a constraint at a constrained optimum (∇f = λ∇g). It turns 'optimize subject to a rule' into one equation and equals the constraint's shadow price. See also: Lagrange multipliers · Convexity & Single-Variable Optimization
Lakehouse: An architecture that layers transactional table formats (such as Delta Lake or Apache Iceberg) on top of cheap object storage, giving a data lake ACID guarantees and SQL-queryable structure previously only available in warehouses. Learn more → See also: Delta Lake & MERGE · OLTP vs OLAP
Lambda: A throwaway, anonymous function defined in a single expression: `lambda x: x * 2`. Use it when you need a simple function as an argument — for example, passing a custom sort key — and it is too small to justify a named `def`. Lambdas cannot contain statements, loops, or multiple lines. See also: Function/tool calling · Chains & LCEL
Large Language Model: A neural network with billions of parameters trained on internet-scale text to predict tokens, which emerges with broad abilities—summarization, translation, question answering, code generation—that were never explicitly programmed. 'Large' refers both to parameter count and to the data and compute required, which together enable capabilities absent in smaller models. See also: What an LLM is · Multilingual NLP
Latent Space: The lower-dimensional internal representation space that a model uses to encode its inputs, where similar concepts cluster together and relationships can be explored by arithmetic on vectors. Generative models use the latent space as the 'creative canvas': sampling or editing a point in latent space and decoding it produces a corresponding output. See also: VAEs from scratch · Latent diffusion & Stable Diffusion
Layer Normalization: A normalization that rescales each individual sample across its feature dimension to roughly mean 0, variance 1, independent of the batch. Unlike batch normalization it behaves identically in training and inference and works with batch size 1, making it the standard choice in transformers. RMSNorm is a simpler, faster variant that drops the mean-subtraction step and is now the default in most open LLMs. See also: Inside the transformer block · Norms & distances
Learning Rate: A number that controls how large each gradient descent step is. Too high and the model overshoots the minimum, oscillating or diverging; too low and training is painfully slow. Finding the right learning rate — or scheduling it to decrease over time — is one of the most impactful hyperparameter decisions in deep learning. Learn more → See also: Batch size ↔ learning rate · Learning-rate schedules
Left Join: A JOIN that returns every row from the left table plus any matching rows from the right table. When no match exists in the right table, the right-side columns appear as NULL rather than being dropped. Learn more → See also: INNER JOIN · Joins & Division
Linear Independence: A set of vectors is linearly independent when none can be written as a combination of the others, so each contributes a genuinely new direction. Dependent vectors are redundant. See also: Independence, Span, Basis & Dimension · Rank, independence & basis
List Comprehension: A compact Python syntax for building a new list by applying an expression to each item in an iterable, optionally filtering items with a condition — all in one line. It replaces a multi-line for-loop with something like `[x**2 for x in range(10) if x % 2 == 0]`. List comprehensions are faster than equivalent loops because Python optimises their bytecode. Learn more → See also: Lists · Control Flow
LlamaIndex: A data framework for LLM apps that turns documents into a queryable index through swappable components — loaders, node parsers, indexes, retrievers, and response synthesizers. In 2026 it builds query engines and agents on event-driven Workflows and leads on document agents. See also: Indexes, query engines & retrievers · Event-driven Workflows
LlamaParse: A document parser that treats each PDF page as an image and uses a vision-language model to read tables, columns, and layout, emitting clean markdown. It fixes the silent RAG ceiling: naive text extraction that scrambles tables and reading order. See also: LlamaParse — document parsing · Multimodal RAG
LLM-as-Judge: Using a capable language model to grade other models' outputs against a rubric, for open-ended quality where no exact metric exists. Scales cheaply and agrees with humans surprisingly often, but suffers biases (position, verbosity, self-preference) that must be controlled with rubrics, order-averaging, and human-label validation. See also: LLM evals & LLM-as-judge · Reflection
LoRA: Low-Rank Adaptation freezes a pre-trained model's weights and injects small trainable rank-decomposition matrices beside each attention layer, adapting the model to a new task with a fraction of the parameters and memory a full fine-tune would require. The original weights are untouched, so multiple LoRA adapters can be swapped in and out of the same base model cheaply. Learn more → See also: Fine-tuning: LoRA & QLoRA · Fine-tune vs RAG: the decision
Loss Function: A mathematical formula that quantifies how wrong a single prediction is by comparing it to the true answer. During training, the optimizer minimizes this quantity across examples; the choice of loss function encodes what kinds of errors matter — mean squared error penalizes large mistakes heavily, while cross-entropy is suited to probability outputs. See also: Loss Functions · Maximum likelihood & MAP
LSTM: Long Short-Term Memory is an RNN variant with an explicit memory cell and three learned gates—input, forget, and output—that control what information is written, erased, and read at each step. The gating mechanism allows LSTMs to carry relevant context across hundreds of time steps without vanishing gradients. See also: RNNs & LSTMs · Multi-Layer Perceptron & Activations

M

Mahalanobis Distance: Distance from a point to a distribution's mean measured in its own stretched, correlated units via Σ⁻¹ — 'how many standard deviations away,' accounting for the shape of the data. Used in anomaly detection. See also: Norms & distances · The multivariate Gaussian
Materialized View: A database object that stores the pre-computed results of a query on disk rather than recomputing them on every request. Queries against it are instant, but the view must be refreshed periodically to reflect changes in the underlying data. See also: KV cache & continuous batching · Window functions
Matryoshka Embeddings: Embeddings trained so the most important information is packed into the first dimensions, allowing a vector (e.g. 3072-d) to be truncated to 512 or 256 dims while keeping ~93–95% of retrieval quality. The standard cost/latency lever for vector search. See also: Embeddings · Embeddings
Maximum A Posteriori (MAP): Maximum likelihood plus a prior belief about the parameters. A Gaussian prior yields L2 (ridge) regularization and a Laplace prior yields L1 (lasso), so regularization is literally MAP estimation. See also: Maximum likelihood & MAP · Ridge Regression & Regularization
Maximum Likelihood Estimation (MLE): Choosing the parameters that make the observed data most probable. Gaussian noise yields mean squared error; categorical labels yield cross-entropy — most loss functions are MLE in disguise. See also: Maximum likelihood & MAP · The multivariate Gaussian
MCP: The Model Context Protocol is an open standard that defines how AI agents discover and call external tools and data sources through a uniform interface, much like HTTP standardized web communication. By separating the tool definitions from the model, MCP lets the same agent code connect to any compliant server without bespoke integration work. Learn more → See also: Advanced MCP primitives · FastMCP — build MCP servers fast
Merge: Combining two DataFrames horizontally by matching rows on a shared key column, exactly like a SQL JOIN. `pd.merge(orders, customers, on='customer_id', how='left')` attaches customer details to every order row, keeping all orders even when no customer record exists. The `how` parameter controls which rows survive: inner, left, right, or outer. Learn more → See also: INNER JOIN · Joins in Spark
Missing Data: Values that are absent from a dataset, represented in pandas as `NaN` (for numeric columns) or `None` / `pd.NA`. Missing data can bias statistics and crash models if ignored. Analysts either drop affected rows, fill them with a substitute value (mean, median, a placeholder), or flag them with an indicator column. Learn more → See also: DataFrame basics · Anomaly detection
Mixture of Experts (MoE): A model design that replaces one large feed-forward layer with many smaller 'expert' networks plus a router that sends each token to only its top-k experts. The model stores all experts (large capacity) but computes only a few per token (low compute), so total parameters can be far larger than the active parameters used on any one token. See also: Mixture of Experts · Mixture of Experts
Model Routing: Matching each query to the cheapest model that can handle it — easy queries to a small cheap model, hard ones to a frontier model — using a complexity classifier or a cheap-first cascade. Because most traffic is easy, routing only the hard fraction up cuts cost 45–85% at near-equal quality. See also: Model routing & cascades · Cost & latency control
Multi-Head Attention: An extension of self-attention that runs several independent attention computations in parallel, each with its own learned projections, then concatenates their outputs. Different heads can learn to track different types of relationships simultaneously—syntax in one head, coreference in another—giving the model richer contextual representations. Learn more → See also: Sparse & sub-quadratic attention · Self-attention
Multivariate Gaussian: The bell curve generalized to many dimensions, described entirely by a mean vector and a covariance matrix. Its contours are ellipses, and it underlies GMMs, Gaussian processes, VAEs, and Kalman filters. See also: The multivariate Gaussian · Gaussian mixture models
Mutability: Whether an object's contents can be changed after it is created. Lists, dicts, and sets are mutable — you can add, remove, or replace items in place. Strings, tuples, and integers are immutable — any 'change' produces a new object. Mutability matters because passing a mutable object to a function lets that function silently alter the original. See also: Lists, Tuples, Dicts, Sets & Gotchas · Tuples
Mutual Information: How much knowing one variable reduces uncertainty about another — zero only when they are independent. It drives information-gain feature selection and many self-supervised objectives. See also: Entropy & information theory · The Bias-Variance Trade-off

N

NaN: Short for 'Not a Number' — the floating-point sentinel value used to represent a missing or undefined numeric result. In pandas, `NaN` propagates silently: adding any number to `NaN` gives `NaN`, so a single missing value can corrupt an entire aggregation. Use `df.isna()` to locate NaNs before they cause silent errors. Learn more → See also: Numbers · Numerical stability
Neural Network: A computational system built from layers of interconnected nodes that transform raw input data into predictions by learning weighted connections during training. Each layer extracts progressively more abstract features, enabling the network to recognize patterns too complex for hand-coded rules. See also: Perceptron & the Update Rule · Multi-Layer Perceptron & Activations
Normalization: The process of organizing table columns and relationships to remove redundant data and update anomalies. A normalized design stores each fact in exactly one place so changing it once keeps the whole database consistent. See also: Normal Forms: 1NF to BCNF · Normalization, Discretization, Sampling, Compression
NumPy Array: A fixed-type, multi-dimensional grid of numbers stored in a single contiguous block of memory. Unlike a Python list, every element has the same data type, so NumPy can apply C-speed loops over the entire array without the overhead of Python object handling. It is the universal currency of numerical Python: pandas, scikit-learn, and TensorFlow all accept or return NumPy arrays. Learn more → See also: The ndarray · Why NumPy

O

OLAP (Online Analytical Processing): A database usage pattern designed for queries that scan millions of rows to compute aggregates, trends, and comparisons across large historical datasets. OLAP systems trade write speed for fast analytical read throughput. Learn more → See also: Concept Hierarchies & Measures · Warehouse, Lake & Lakehouse
OLTP (Online Transaction Processing): A database usage pattern optimized for high-volume, short-lived read/write operations that affect a small number of rows at a time — like inserting an order or updating an account balance. OLTP systems prioritize low latency and ACID compliance. Learn more → See also: ETL vs ELT · Warehouse, Lake & Lakehouse
One-Hot Encoding: A way to represent a categorical variable with no natural order (such as country or color) by replacing it with a set of binary columns, one per category, where exactly one column is 1 and the rest are 0. This lets algorithms treat categories as distinct without implying a false numeric ranking between them. See also: Feature engineering & encoding · Rank, Nullity & Solution Sets
Optimizer: The algorithm that uses computed gradients to update a model's weights, determining how quickly and in which direction each parameter moves. The choice of optimizer—and its hyperparameters such as learning rate—has a major influence on whether training converges, how fast it does, and the quality of the final model. Learn more → See also: Gradient Descent (One Step) · Partial derivatives
Orchestration: The coordination of when, in what order, and on what schedule individual pipeline tasks run, including handling retries on failure and alerting on errors. An orchestrator manages the dependencies between tasks so each one starts only when its inputs are ready. Learn more → See also: Pipeline orchestration · Multi-agent: supervisor & swarm
Orthogonality: Two vectors are orthogonal when their dot product is zero — geometrically, perpendicular. Orthogonal directions share no linear information, making projections and coordinates trivial to compute. See also: Orthogonality & Orthogonal Matrices · Orthogonality & least squares
Outlier: A data point that lies far outside the typical range of the other values. Outliers can be legitimate extreme events (a billionaire in an income survey), measurement errors, or data-entry mistakes. Left untreated they skew means, distort regression lines, and confuse distance-based models; whether to remove, cap, or keep them depends on why they exist. See also: Anomaly detection · Skew & salting
Overfitting: A model that has memorized the training data's noise and quirks rather than the underlying pattern — so it scores impressively on training examples but fails on new ones. The fix is more data, regularization, early stopping, or a simpler model architecture. Learn more → See also: Overfitting & bias–variance · Bias–variance & learning curves

P

P-Value: The probability of observing a result at least as extreme as the one you got, assuming the null hypothesis is true. A small p-value (e.g., below 0.05) means the data would be very surprising if nothing were going on — but it does not measure how large or practically important the effect is. Learn more → See also: A/B testing · Expected Value
PagedAttention: A KV-cache memory technique (introduced by vLLM) that allocates the cache in small fixed blocks on demand, like operating-system paging, instead of reserving the maximum sequence length per request. It nearly eliminates memory waste, letting far more requests run concurrently and raising throughput. See also: KV cache & continuous batching · KV cache offloading & memory tiers
Parquet: An open-source, columnar file format designed for efficient storage and fast analytical reads of large datasets. It embeds the schema inside the file, supports predicate pushdown, and achieves high compression ratios, making it the de-facto standard for data lake storage. Learn more → See also: Warehouse, Lake & Lakehouse · Delta Lake & MERGE
Partitioning: The division of a large table into smaller, physically separate segments based on the values of one or more columns, such as date or region. Queries that filter on the partition key skip irrelevant segments entirely, dramatically reducing I/O. See also: Partitioning · Columnar Storage & Parquet
Perceptron: The simplest neural unit: it multiplies each input by a learned weight, sums the results, adds a bias, and passes the total through an activation function to produce one output. A single perceptron can separate two linearly separable classes; stacking many of them creates a deep network capable of far richer boundaries. See also: Perceptron & the Update Rule · Multi-Layer Perceptron & Activations
Pickle: Python's built-in serialisation format that converts any Python object — a trained model, a complex dict, a custom class — into bytes that can be saved to disk and loaded back later. It is convenient for caching expensive computation between sessions, but pickle files from untrusted sources can execute arbitrary code on load, so never unpickle data you did not create yourself. See also: DataFrame basics · Lists
Pivot Table: A reshaped summary table where unique values of one column become the new column headers, allowing you to see aggregated metrics across two dimensions at a glance. In pandas, `pd.pivot_table(df, values='sales', index='region', columns='product', aggfunc='sum')` turns long-format transaction data into a region-by-product sales matrix. See also: pivot, melt, stack · GroupBy
Pooling: A downsampling operation that summarizes small regions of a feature map into single values—typically the maximum or average—reducing spatial dimensions and making representations more robust to small translations. It shrinks the computation required for subsequent layers while retaining the most salient detected features. See also: PCA & dimensionality reduction · Text summarization
Population: The complete set of individuals or observations you want to draw conclusions about. Because measuring every member of a population is often impractical, analysts work with samples and use statistics to infer population properties. Knowing your intended population prevents scope creep — a model trained on US customers does not automatically generalise to global ones. See also: Sampling methods · Approximate Inference: Sampling
Positional Encoding: A signal added to each token's embedding before it enters the transformer to tell the model the token's position in the sequence, since self-attention itself is order-agnostic. Fixed sinusoidal patterns or learned vectors both work; without positional encoding, the model would treat 'dog bites man' and 'man bites dog' identically. Learn more → See also: Attention (the RNN era) · Embeddings
Positive Semi-Definite (PSD): A symmetric matrix whose eigenvalues are all ≥ 0, equivalently vᵀMv ≥ 0 for every vector v. Covariance matrices and convex Hessians are PSD — the matrix version of 'non-negative.' See also: Projections & Idempotent Matrices · Singular Value Decomposition
Precision: Of all the examples a model labeled positive, the fraction that truly are positive. High precision means few false alarms — critical in spam filters where legitimate mail must not be mislabeled. Learn more → See also: Model Calibration · Bayes' Theorem
Primary Key: A column or set of columns whose values uniquely identify every row in a table. No two rows can share the same primary key value, and the column cannot be NULL, making it the definitive address for any record. See also: Keys & Integrity Constraints · Deduplication
Principal Component Analysis (PCA): A technique that reduces the number of features by finding new axes — principal components — that capture the directions of greatest variance in the data, then projecting the data onto the top few components. It compresses information, removes correlated features, and can make downstream models faster and less prone to the curse of dimensionality. Learn more → See also: PCA & dimensionality reduction · PCA & Dimensionality Reduction
Prompt Engineering: The craft of designing the text input to a language model—instructions, examples, context, formatting—to reliably elicit the desired output without changing the model's weights. Because LLMs are sensitive to phrasing, a well-designed prompt can dramatically improve accuracy, safety, or output structure on the same underlying model. See also: Prompt patterns that work · Pretraining LLMs
Prompt Injection: An attack where text in the input or in content the model ingests (a web page, document, or RAG chunk) is interpreted as instructions, making the model disobey its system prompt. The #1 OWASP LLM risk; defended in depth with input/output guardrails, instruction hierarchy, and least-privilege tools. See also: Prompt injection & guardrails · Agent Security

Q

Quantization: Reducing the numerical precision of a model's weights—from 32-bit floats to 8-bit or 4-bit integers—to shrink its memory footprint and speed up inference with minimal accuracy loss. Quantized models run on consumer GPUs and even CPUs that could not fit the full-precision version, democratizing access to large models. See also: Quantization · Distillation
Query Optimizer: The component inside a database engine that analyzes a SQL statement and selects the most efficient execution plan — deciding join order, which indexes to use, and how to parallelize work — before any data is touched. See also: Adaptive Query Execution · SELECT basics

R

RAG: Retrieval-Augmented Generation combines a language model with a retrieval system: at query time, relevant documents are fetched from an external store and injected into the prompt so the model can ground its answer in up-to-date, verifiable text rather than relying solely on memorized training knowledge. This reduces hallucinations and avoids the need to retrain the model whenever facts change. Learn more → See also: Advanced RAG · Multimodal RAG
Random Forest: An ensemble of decision trees, each trained on a random bootstrap sample of the data with a random subset of features considered at each split. Averaging many diverse, slightly wrong trees cancels out their individual errors, producing a robust model that rarely overfits as badly as a single tree. Learn more → See also: Bagging, boosting & stacking · Decision Trees
Rank: The number of linearly independent columns of a matrix — the count of pivots after elimination, or of non-zero singular values. It measures how many genuine dimensions the data spans. See also: Rank, independence & basis · Rank, Nullity & Solution Sets
RDD (Resilient Distributed Dataset): Spark's foundational distributed data abstraction: an immutable, partitioned collection of records spread across a cluster that can be recomputed from its lineage if a partition is lost. Modern Spark code favors DataFrames, but RDDs remain the underlying execution primitive. See also: DataFrame intro · Databricks platform overview
ReAct: An agent reasoning loop that interleaves Thought → Action → Observation: the agent reasons, takes one tool action, observes the result, and reasons again. It adapts to each observation (good for dynamic tasks) at the cost of one LLM call per step. See also: ReAct, Plan-Execute, Reflexion · ReWOO: plan-execute without observation
Reasoning Model: A language model trained (often with reinforcement learning) to spend test-time compute on a long internal chain of thought before answering. o-series and DeepSeek-R1-style models trade extra inference cost for higher accuracy on hard math, code, and reasoning tasks; the lever you tune is a thinking budget, not temperature. See also: Reasoning models & test-time compute · Tree of Thoughts
Recall: Of all examples that actually are positive, the fraction the model correctly identified. High recall means few real positives are missed — critical in cancer screening where missing a true case is far costlier than a false alarm. Learn more → See also: Confusion Matrix, Precision, Recall, ROC · Model Calibration
Recommender System: A model that predicts which items a user is most likely to find relevant and surfaces them proactively — products, movies, articles, songs. Recommenders power a huge share of internet engagement. The two dominant families are collaborative filtering (learn from collective user behaviour) and content-based filtering (match item attributes to user preferences). Learn more → See also: Content-based filtering · The utility matrix
Reduced Row Echelon Form (RREF): Row echelon form taken further: every pivot is 1 and is the only non-zero entry in its column. Once a system is in RREF, the solution can be read directly off the matrix. See also: Linear systems & RREF · LU Decomposition
Reflexion: An agent loop that attempts a task, self-critiques the result (reflection), and retries with that feedback. It trades extra passes for quality and is most valuable on hard tasks whose results can be verified, like code that must pass tests. See also: ReAct, Plan-Execute, Reflexion · Reflection
Regularization: A family of techniques that add a penalty to a model's loss function to discourage it from growing overly complex, reducing overfitting. The two most common forms are L1 (which zeroes out unimportant features) and L2 (which shrinks all weights toward zero without eliminating any). Learn more → See also: Ridge Regression & Regularization · Maximum likelihood & MAP
ReLU: Short for Rectified Linear Unit, it outputs the input unchanged if positive and zero otherwise—a simple rule that trains faster than older activations and avoids the vanishing-gradient problem for most layers. Its sparsity (many neurons output zero) also acts as a mild regularizer. Learn more → See also: Multi-Layer Perceptron & Activations · Ridge Regression & Regularization
Residual Connection: A shortcut that adds a layer's input to its output (x + f(x)) instead of replacing it. This gives gradients a direct path backward that skips the layer's multiplications, which is what allows very deep networks like ResNets and transformers to train without vanishing gradients. See also: Inside the transformer block · Vanishing & exploding gradients
REST: Representational State Transfer — an architectural style for web APIs that maps operations to HTTP verbs (GET to read, POST to create, PUT/PATCH to update, DELETE to remove) and addresses resources with URLs. A REST API for a user database might use `GET /users/42` to fetch user 42 and `POST /users` to create a new one. Most public data APIs follow REST conventions. See also: Serving with FastAPI · The Relational Model
RLHF: Reinforcement Learning from Human Feedback trains a reward model on human preference judgments—which of two outputs is better—then uses that reward signal to fine-tune the language model via reinforcement learning. This is how instruction-following and safety behaviors are instilled in models like ChatGPT, aligning outputs with what humans actually want rather than raw token prediction. See also: Direct Preference Optimization (DPO) · Alignment: SFT, RLHF & DPO
RNN: A Recurrent Neural Network processes sequences one element at a time, passing a hidden state forward so each step can depend on everything seen so far. This makes RNNs naturally suited for text and time-series, but long sequences cause vanishing gradients that prevent the network from retaining early context. See also: RNNs & LSTMs · Sequence-to-sequence models
ROC AUC: The area under the Receiver Operating Characteristic curve, which plots true positive rate against false positive rate at every possible classification threshold. A score of 0.5 means the model is no better than random guessing; 1.0 means perfect discrimination — making AUC a threshold-independent measure of ranking quality. Learn more → See also: Confusion Matrix, Precision, Recall, ROC · Model Calibration
Row Echelon Form (REF): A matrix reshaped so its leading non-zero entries (pivots) step down and to the right with zeros below them — the staircase Gaussian elimination produces on the way to solving a linear system. See also: Linear systems & RREF · Systems of Equations & Gaussian Elimination

S

Sample: A subset of individuals drawn from a larger group and used to draw conclusions about that group. The quality of a sample matters enormously: a biased sample produces misleading results no matter how sophisticated the analysis. Randomly selecting samples and ensuring they are representative of the population is a foundational discipline in statistics. See also: Sampling methods · Sampling & Reservoir Sampling
Sampling Bias: A systematic skew that occurs when the sample used for analysis is not representative of the population you want to draw conclusions about. Unlike random sampling error (which averages out with more data), sampling bias cannot be corrected by collecting more biased data. See also: Skew & salting · Fairness & bias in ML
Schema: A formal description of how data is structured — the tables, their columns, each column's data type, and the relationships among tables. In some databases the term also refers to a named namespace that groups related tables together. See also: Star vs Snowflake Schemas · The Relational Model
Self-Attention: A mechanism that computes, for each token in a sequence, a weighted mixture of all other tokens' representations, where the weights reflect how relevant each token is as context. This lets the model capture long-range dependencies—such as a pronoun referring to a noun several sentences back—in a single operation. Learn more → See also: Differential attention · Attention (the RNN era)
Semantic Search: A search method that matches queries to documents by meaning rather than by keyword overlap, using embedding vectors so that 'heart attack' and 'myocardial infarction' retrieve the same results. It replaces brittle exact-string matching with geometry: the closer two embeddings in vector space, the more semantically related the texts. See also: Embeddings · Embeddings
Series: A single labelled column of data in pandas — essentially a one-dimensional array with an index. Slicing one column from a DataFrame gives you a Series; combining multiple Series produces a DataFrame. Series operations are vectorised and align automatically on the index, so adding two Series with different row orders still gives correct results. Learn more → See also: Series · Indexing & slicing
Set: An unordered collection of unique elements that supports fast membership tests and set-algebra operations — union, intersection, difference. Adding a duplicate to a set simply does nothing. Use a set when you need to eliminate duplicates or ask 'is this value in the collection?' on large data efficiently. See also: Sets · Lists, Tuples, Dicts, Sets & Gotchas
SettingWithCopyWarning: A pandas warning fired when you try to modify a DataFrame that might be a view (a window into another DataFrame's memory) rather than an independent copy. Writing `df[df.age > 18]['score'] = 0` may silently do nothing if `df[df.age > 18]` is a view. The fix is to call `.copy()` explicitly or use `.loc` to assign in one step. Learn more → See also: When O(n²) Kills Your DataFrame · Memory optimization
SGD: Stochastic Gradient Descent updates weights using the gradient computed on a small random subset of the data rather than the full dataset. The randomness introduces noise that can help the model escape sharp local minima, and with momentum and a good learning-rate schedule it remains competitive with adaptive optimizers on many vision tasks. Learn more → See also: Gradient Descent (One Step) · Gradient descent
Sharding: A horizontal scaling technique that splits a database into independent pieces called shards, each stored on a different server, with a routing layer directing each query to the correct shard. It distributes both data volume and request load across machines. See also: Partitioning · Multi-GPU: DDP & FSDP
Shuffle: The most expensive operation in distributed processing: redistributing rows across all worker nodes so that all records sharing the same key end up on the same machine before a join or aggregation. Excessive shuffling is the primary cause of slow Spark jobs. See also: Shuffles · Skew & salting
Sigmoid: An S-shaped function that squashes any real number into the range (0, 1), making it natural for predicting probabilities from a single neuron. Its very flat tails cause vanishing gradients in deep networks, which is why it has largely been replaced by ReLU in hidden layers while remaining popular in output layers for binary classification. Learn more → See also: Logistic Regression · Logistic regression
Simpson's Paradox: A statistical phenomenon where a trend that appears in several separate groups reverses or disappears when the groups are combined. It typically arises when a lurking variable (a confounder) is distributed unevenly across groups, misleading aggregated comparisons. Learn more → See also: Independent vs Mutually Exclusive · Trend, seasonality & decomposition
Singular Value Decomposition (SVD): The factorization A = UΣVᵀ that writes any matrix as a rotation, an axis-aligned stretch (the singular values), and another rotation. It is the basis of PCA, compression, the pseudo-inverse, and low-rank methods like LoRA. See also: Singular Value Decomposition · Singular Value Decomposition
Slowly Changing Dimension (SCD): A dimension whose attribute values change infrequently over time — for example, a customer's address or job title. SCD techniques define whether the old value is overwritten (Type 1), preserved in a new row with validity dates (Type 2), or stored alongside the current value (Type 3). Learn more → See also: Dimensional Modeling · Star vs Snowflake Schemas
Snowflake Schema: An extension of the star schema where dimension tables are normalized into sub-dimensions, producing a branching structure that resembles a snowflake. It reduces data redundancy but requires more joins per query compared to a star schema. Learn more → See also: Star vs Snowflake Schemas · Schemas
Softmax: A function applied to a vector of raw scores that converts them into a valid probability distribution: all outputs are positive and sum to exactly 1. The highest score gets amplified disproportionately, making the predicted class stand out, and the result can be directly compared to a one-hot label using cross-entropy loss. Learn more → See also: Differential attention · Maximum likelihood & MAP
Span: Everything you can reach by scaling and adding a set of vectors. Two non-parallel vectors span a plane; two parallel ones span only a line. See also: Independence, Span, Basis & Dimension · Vectors
Spark: An open-source distributed processing engine that executes large-scale data transformations across a cluster of machines by holding intermediate results in memory rather than writing them to disk between steps. It supports SQL, streaming, machine learning, and graph processing through a unified API. See also: The Spark ecosystem · Shuffles
Standard Error: The standard deviation of an estimate across hypothetical repeated samples, σ/√n for a mean. Because it shrinks like 1/√n, halving your uncertainty costs four times the data. See also: Estimation & confidence intervals · Normal & Standard Normal
Star Schema: A dimensional model layout with one central fact table joined to multiple flat dimension tables. The diagram looks like a star — the fact table at the hub, dimension tables as points — and the design enables simple, fast analytical queries. Learn more → See also: Star vs Snowflake Schemas · Slowly Changing Dimensions
Stationarity: A time series is stationary when its statistical properties — mean, variance, autocorrelation — do not change over time. Most classical forecasting models assume stationarity because a shifting mean makes past patterns unreliable guides to the future. Non-stationary series are made stationary by differencing (subtracting consecutive values) or taking logarithms. Learn more → See also: Trend, seasonality & decomposition · Why time series is different
Streaming: A data processing model where records are processed continuously and incrementally as they arrive, with results updated in near-real time rather than waiting for an entire dataset to accumulate. Streaming systems trade some throughput for low end-to-end latency. See also: Change Data Capture (CDC) · Streaming responses
Subquery: A SELECT statement embedded inside another SQL statement, used to compute an intermediate result that the outer query then filters or joins against. When the inner query references columns from the outer query, it is called a correlated subquery. Learn more → See also: INNER JOIN · CASE expressions
Supervisor (orchestrator-workers): A multi-agent topology where a central agent decomposes a task, delegates sub-tasks to worker agents, and synthesizes their results. The most common production multi-agent shape, justified when sub-tasks need isolated context windows. See also: Multi-agent: supervisor & swarm · Workflows
Surrogate Key: A system-generated identifier — often an auto-incrementing integer or UUID — assigned to each row with no business meaning. It remains stable even if the natural business identifier changes, making it safer to use as a join key across time. Learn more → See also: Keys & Integrity Constraints · Slowly Changing Dimensions

T

Target Variable: The outcome a model is built to predict — used interchangeably with 'label' in supervised learning. Choosing the right target is a modelling decision, not a data decision: predicting 'will this customer churn in 30 days?' and 'how many days until this customer churns?' require different targets and different model types even from the same dataset. See also: Multi-token prediction · Logistic regression
Taylor Expansion: Approximating a function near a point by its value, slope, and curvature: f(x+Δ) ≈ f(x) + f′(x)Δ + ½f″(x)Δ². The foundation of linearization and second-order optimization. See also: Jacobian, Hessian & Taylor · Critical Points & Monotonicity
Temperature: A scalar that controls how sharply a language model's probability distribution is peaked over possible next tokens: low temperature makes the model nearly deterministic and conservative, while high temperature flattens the distribution and introduces more variety and surprise. Setting temperature to 0 always picks the most likely token; setting it above 1 can produce creative but incoherent text. See also: Sampling: temperature, top-k, top-p · Softmax
Test-Time Compute: Compute spent per query at inference rather than during training. Reasoning models use it to think before answering; accuracy rises with the thinking budget but plateaus and can even decline (overthinking), so there is a cost/quality sweet spot. See also: Reasoning models & test-time compute · Cost & FinOps for ML/GPUs
The GIL (Global Interpreter Lock): A mutex inside CPython that allows only one thread to execute Python bytecode at a time, even on multi-core machines. This makes single-threaded programs safe and simple but means pure-Python threads cannot speed up CPU-bound work. I/O-bound tasks (network calls, disk reads) still benefit from threading because the GIL is released while waiting. Learn more → See also: Multiprocessing · Threading
Time Series: A sequence of measurements recorded at successive, usually equally spaced points in time — stock prices, hourly temperatures, monthly sales. The defining feature is that observations are not independent: yesterday's value predicts today's. Analysing time series requires techniques that account for trend, seasonality, and autocorrelation. See also: Why time series is different · Trend, seasonality & decomposition
Tokenization: The process of splitting raw text into the discrete units—tokens—that a language model consumes. Tokens are neither words nor characters but something in between: common words become single tokens while rare words are split into subword pieces, balancing vocabulary size against the ability to handle unseen terms. See also: Tokenization · Tokenization & BPE
Top-p Sampling: A decoding strategy that at each step considers only the smallest set of tokens whose cumulative probability exceeds a threshold p, discarding the long tail of unlikely options before sampling. Unlike top-k, the number of candidates changes dynamically with the distribution, avoiding situations where the model samples from near-random low-probability tokens. See also: Sampling: temperature, top-k, top-p · Sampling & Reservoir Sampling
Train-Test Split: The practice of partitioning a dataset into a training portion (used to fit the model) and a test portion (used only to measure final performance). The test set must stay locked away during development; peeking at it — even to choose hyperparameters — inflates apparent performance. Learn more → See also: Supervised vs Unsupervised; Train/Test · Cross-Validation: k-fold, LOO, Stratified
Transaction: A sequence of database operations bracketed by BEGIN and COMMIT that the engine treats as a single indivisible unit. If any step fails, a ROLLBACK undoes every preceding step in that sequence as if none of it happened. See also: Joins & Division · Durable execution for agents
Transfer Learning: The practice of reusing a model trained on one large task as a starting point for a different, usually smaller task, on the premise that general representations learned early transfer to the new domain. It is why pre-trained language and vision models dominate: a few thousand labeled examples plus a strong prior beats millions of examples and random initialization. See also: Multilingual NLP · Multi-token prediction
Transformer: An architecture that replaces sequential recurrence with self-attention, letting every token directly attend to every other token in a sequence in parallel. Transformers train much faster on modern hardware than RNNs and have become the dominant architecture for language, vision, and multimodal models. Learn more → See also: Multi-head attention · Vision Transformers (ViT)
Tuple: An immutable, ordered sequence of values, written with parentheses: `(1, 'a', 3.0)`. Because tuples cannot be changed, Python can store them more efficiently than lists, and they can be used as dictionary keys or set members. They are the natural type for a fixed record such as a (latitude, longitude) coordinate pair. See also: Tuples · Lists, Tuples, Dicts, Sets & Gotchas
Type Hint: Optional annotations that tell readers (and tools) what type a variable, parameter, or return value should be: `def add(x: int, y: int) -> int`. Python does not enforce them at runtime, but type-checkers like mypy and IDE autocomplete use them to catch mistakes early. They make large codebases dramatically easier to maintain. Learn more → See also: Functions · Pydantic v2
Type I Error: Rejecting a null hypothesis that is actually true — a false positive. In an A/B test, this means declaring a winner when the two variants perform identically. The significance level alpha (commonly 0.05) is the maximum Type I error rate you are willing to tolerate. Learn more → See also: A/B testing · z-test, t-test & chi-squared test
Type II Error: Failing to reject a null hypothesis that is actually false — a false negative. In an A/B test, this means missing a real improvement. Reducing Type II error requires a larger sample size or accepting a higher false positive rate. Learn more → See also: A/B testing · z-test, t-test & chi-squared test

U

Underfitting: A model too simple to capture the real structure in the data — it performs poorly even on training examples because it has not learned enough. The remedy is adding more features, increasing model complexity, or training for longer. Learn more → See also: Overfitting & bias–variance · Feature selection

V

VAE: A Variational Autoencoder learns to encode inputs into a probability distribution over a compact latent space rather than a single point, then decode samples from that distribution back to data. The probabilistic bottleneck forces the latent space to be smooth and continuous, enabling controlled generation by sampling or interpolating in latent space. Learn more → See also: VAEs from scratch · Latent diffusion & Stable Diffusion
Vanishing Gradient: A training failure where gradients shrink exponentially as they travel backward through many layers, leaving early layers learning almost nothing while later layers update normally. It was the main obstacle to training deep networks before ReLU activations, residual connections, and careful initialization became standard. See also: Vanishing & exploding gradients · The training loop
Vector Database: A storage system designed to index and query dense embedding vectors efficiently, enabling nearest-neighbor search across millions of items in milliseconds. It is the retrieval backbone of RAG pipelines: documents are embedded at ingest time, and at query time the closest embeddings to the question embedding are fetched as context. Learn more → See also: Vector databases · Embeddings
Vectorization: The practice of replacing Python loops with array-level operations that NumPy or pandas execute in compiled C code. `array * 2` multiplies every element at once instead of iterating in Python. Vectorised code is typically 10–100× faster than the equivalent loop because it avoids Python's per-iteration overhead. Learn more → See also: Vectorization vs Loops · Why NumPy
Virtual Environment: An isolated folder containing a private Python interpreter and its own package installations, created with `python -m venv`. It prevents version conflicts between projects: project A can use pandas 1.5 while project B uses pandas 2.0 on the same machine. Activating an environment puts its `python` and `pip` commands first on your PATH. See also: Environments & packaging: venv, uv, pex · Getting Started
Vision-Language Model (VLM): A multimodal model that reads images alongside text: a vision encoder splits an image into patches, a projection layer maps them into the LLM's token embedding space, and the transformer processes image tokens next to text tokens. Image cost scales with resolution because more pixels mean more tokens. See also: Multimodal (vision & audio) LLMs · Vision-language model architectures

W

Weight Initialization: The choice of starting values for a network's weights before training. The scale matters enormously: too large and the signal explodes through depth, too small and it vanishes. Xavier/Glorot init (for tanh) and Kaiming/He init (for ReLU) set the scale so each layer roughly preserves variance, which is what lets deep networks train. See also: Weight initialization · Activation functions
Window Function: A SQL function that computes a value for each row by looking at a sliding "window" of related rows, without collapsing them into a single group. It lets you rank, compute running totals, or compare a row to its neighbors while keeping every row in the output. Learn more → See also: Window functions · Ranking functions
Word2Vec: A family of shallow neural networks trained to predict a word from its neighbors (or vice versa), with the side effect that the learned weight vectors capture meaningful semantic relationships. The famous result is that the vector arithmetic king − man + woman ≈ queen emerges purely from predicting context in a large corpus. See also: word2vec: CBOW & skip-gram · GloVe & fastText