Data & AI glossary
190 terms across data science, statistics, machine learning, deep learning, and AI — defined in plain English, no circular jargon.
A
- A/B Testing
- A controlled experiment that randomly splits users into two groups — one receiving the current version (control) and one receiving a change (treatment) — then measures whether the difference in outcome is larger than sampling noise could explain. Randomization is what lets you attribute the difference to the change rather than to pre-existing differences between users. Learn more → See also: A/B Testing for Decisions · Central limit theorem
- ACID
- Four guarantees a database transaction must satisfy: Atomicity (the whole operation succeeds or none of it does), Consistency (data always moves from one valid state to another), Isolation (concurrent transactions do not interfere), and Durability (committed data survives crashes). Learn more → See also: Delta Lake & MERGE · Circuit breakers & resilience
- Activation Function
- A mathematical gate applied to a neuron's summed inputs that decides how strongly the neuron fires. Without it, stacking layers would collapse to a single linear transformation; non-linear activations like ReLU or sigmoid let networks learn curves, boundaries, and hierarchical abstractions. Learn more → See also: Multi-Layer Perceptron & Activations · Perceptron & the Update Rule
- Adam
- An optimizer that maintains a separate adaptive learning rate for each weight by tracking both the average gradient and the average squared gradient. This makes it self-tuning across parameters of very different scales, and it is the default optimizer for most deep learning work today. Learn more → See also: Gradient descent · Gradient Descent (One Step)
- Agent
- An AI system that uses a language model as its reasoning engine to plan and execute multi-step tasks, calling tools—web search, code execution, APIs—and deciding what to do next based on intermediate results. Unlike a single prompt-response exchange, an agent loops: observe, think, act, observe again, until the goal is reached. See also: What agentic AI means · AGENTS.md, Skills & Tools
- Aggregate Function
- A function that takes a set of rows as input and returns a single summary value — such as SUM, COUNT, AVG, MIN, or MAX. Aggregates collapse many rows into one scalar per group. Learn more → See also: Aggregations & axis · Concept Hierarchies & Measures
- Airflow
- An open-source workflow orchestration platform where pipelines are defined as Directed Acyclic Graphs (DAGs) in Python code. Each node in the DAG is a task; Airflow schedules, monitors, and retries tasks, providing a web UI to visualize pipeline runs. Learn more → See also: Kubeflow Pipelines · Workflows
- API
- Application Programming Interface — a defined contract that lets two programs exchange data or trigger actions without either needing to know the other's internal code. In data work, 'calling an API' almost always means sending an HTTP request to a web service and receiving structured data back. The API hides complexity: you ask for weather data; the service figures out how to retrieve and format it. See also: A2A — Agent2Agent Protocol · FastAPI
- apply vs vectorize
- `DataFrame.apply()` runs a Python function row-by-row or column-by-column — flexible but slow because Python overhead accumulates over thousands of calls. NumPy's `np.vectorize()` is similar in spirit but still loops under the hood. For real speed, replace both with a built-in pandas or NumPy operation that is already implemented in C. Reserve `apply` for logic too complex to express as a vectorised expression. See also: Pandas UDFs · Why NumPy
- ARIMA
- AutoRegressive Integrated Moving Average — a classical statistical model for forecasting a univariate time series using its own past values (AR), past forecast errors (MA), and a differencing step (I) to remove trends and make the series stationary. ARIMA(p, d, q) notation specifies the number of autoregressive lags p, differencing operations d, and moving-average terms q. Learn more → See also: SARIMA (seasonal) · Autoregression (AR)
- Autoencoder
- A neural network trained to compress an input through a narrow bottleneck layer and then reconstruct it, forcing the network to distill the most essential information. The bottleneck layer's activations form a compact latent representation used for tasks like denoising, anomaly detection, and as the backbone of more advanced generative models. See also: The Transformer Architecture · Multi-Layer Perceptron & Activations
B
- Backpropagation
- The algorithm that trains neural networks by computing how much each weight contributed to the final error, then nudging every weight in the direction that reduces that error. It works by applying the chain rule of calculus backward through every layer, from the loss all the way to the first layer's weights. See also: Backpropagation foundations · Backpropagation (One Step)
- Batch Normalization
- A technique that standardizes each layer's activations to zero mean and unit variance across the current mini-batch, then lets the network learn an optional rescaling. This stabilizes training by reducing internal covariate shift, often allowing higher learning rates and making the network less sensitive to weight initialization. Learn more → See also: Distillation · Normalization, Discretization, Sampling, Compression
- Batch Processing
- A data processing model where a bounded set of records is accumulated over a period and then processed all at once as a single job. Batch jobs maximize throughput and are ideal for scheduled reporting but introduce latency equal to the collection interval. See also: Queues & batch pipelines · Normalization, Discretization, Sampling, Compression
- Batch Size
- The number of training examples processed together before the model's weights are updated once. Larger batches give more accurate gradient estimates but require more memory; smaller batches introduce noise that can help escape local minima and often generalize better. See also: Distillation · Quantization
- Bayes' Theorem
- A formula for updating a prior belief in light of new evidence: posterior probability is proportional to the prior times the likelihood of the evidence given that prior. It formalizes rational belief revision and underpins Bayesian inference, spam filters, and probabilistic classifiers like Naive Bayes. Learn more → See also: Bayes' Theorem · Conditional & Total Probability
- BERT
- A transformer encoder pre-trained on masked language modeling—predicting randomly hidden words using both left and right context—plus next-sentence prediction. Its bidirectional representations set a new standard on nearly every NLP benchmark when released, and fine-tuning BERT on a task-specific dataset requires far less labeled data than training from scratch. Learn more → See also: The Transformer Architecture · Positional encodings & RoPE
- Bias-Variance Tradeoff
- The fundamental tension in model design: a model with high bias is too rigid and misses real patterns (underfits), while one with high variance is too sensitive to training noise (overfits). The goal is finding the sweet spot where total error — bias squared plus variance — is minimized. Learn more → See also: The Bias-Variance Trade-off · VAR (multivariate)
- BPE
- Byte-Pair Encoding is a tokenization algorithm that starts with individual characters and iteratively merges the most frequent adjacent pair into a new token, repeating until the vocabulary reaches a target size. The result represents common words as single tokens and rare words as sequences of subword pieces, making it efficient across many languages. See also: Positional encodings & RoPE · Embeddings
- Broadcast Join
- A join strategy in which a small table is copied in full to every worker node so that the large table's rows can be matched locally without any shuffle. It eliminates expensive network data movement when one side of the join fits comfortably in memory. See also: Joins in Spark · Joins & Division
- Broadcasting
- NumPy's rule for performing arithmetic between arrays of different shapes by 'stretching' the smaller array along dimensions of size 1. Subtracting a column mean from every row of a matrix — `matrix - means` — works automatically without writing a loop or manually repeating the means. Broadcasting makes data normalisation concise and efficient. Learn more → See also: Broadcasting · dot, matmul, @
C
- Cardinality
- The number of distinct values in a column relative to the total number of rows. High-cardinality columns like user IDs have nearly as many unique values as rows; low-cardinality columns like country codes have very few. Cardinality guides index design, join strategy selection, and partition pruning. See also: Rank, Nullity & Solution Sets · Keys & Integrity Constraints
- CDC (Change Data Capture)
- A technique for detecting and capturing row-level inserts, updates, and deletes from a source database as they happen, typically by reading the database's write-ahead log. Downstream systems receive a continuous, ordered stream of changes rather than periodic full-table snapshots. Learn more → See also: Slowly Changing Dimensions · Normalization, Discretization, Sampling, Compression
- Central Limit Theorem
- A foundational result stating that the mean of a large enough random sample will be approximately normally distributed, regardless of the shape of the original population distribution. This is why so many statistical tests assume normality even when the underlying data is skewed or discrete. Learn more → See also: Central Limit Theorem & Confidence Intervals · Normal & Standard Normal
- Chain-of-Thought
- A prompting strategy that asks a language model to write out its intermediate reasoning steps before giving a final answer, mimicking how a person 'shows their work.' The explicit reasoning trace improves accuracy on multi-step arithmetic, logic, and common-sense problems where jumping straight to an answer fails. See also: Few-shot & chain-of-thought · Prompt patterns that work
- Closure
- A nested function that remembers the variables from its enclosing scope even after that outer function has returned. It is how Python implements stateful callbacks without a class. Decorators rely on closures to 'wrap' a target function and keep a reference to it. See also: Decorators · Functions, Scope & the Mutable-Default Trap
- CNN
- A Convolutional Neural Network applies small learned filters that slide across an input—typically an image—to detect local patterns such as edges, textures, and shapes regardless of where they appear. Stacking convolution layers followed by pooling lets the network build up from low-level features to high-level semantic concepts. See also: Multi-Layer Perceptron & Activations · Hybrid & neural recommenders
- Collaborative Filtering
- A recommendation technique that predicts a user's preferences by finding other users with similar taste and recommending items those users liked. It requires no knowledge of item content — just the history of who liked what. The key weakness is the cold-start problem: new users or new items with no interaction history receive poor recommendations. Learn more → See also: User-based collaborative filtering · Item-based collaborative filtering
- Columnar Storage
- A physical file layout that stores all values of a single column together on disk rather than grouping columns by row. Analytical queries that touch only a few columns read far less data, and values within a column compress much more efficiently than mixed-type rows. Learn more → See also: Star vs Snowflake Schemas · Warehouse, Lake & Lakehouse
- Confidence Interval
- A range constructed from sample data such that, if you repeated the study many times, the interval would contain the true population parameter in a specified percentage of runs (e.g., 95%). A wide interval signals high uncertainty; a narrow one signals a precise estimate. Learn more → See also: Central Limit Theorem & Confidence Intervals · Central limit theorem
- Confusion Matrix
- A table that breaks predictions into four cells — true positives, true negatives, false positives, and false negatives — giving a complete picture of where a classifier succeeds and fails. Every classification metric (precision, recall, F1, accuracy) can be derived from it. Learn more → See also: Confusion Matrix, Precision, Recall, ROC · ML-specific plots
- Context Window
- The maximum number of tokens—prompt plus generated output combined—that a transformer model can process in one forward pass, determined at training time. Text beyond this limit is simply invisible to the model; expanding context windows is an active research area because longer contexts enable document-level reasoning and multi-turn memory. See also: Tokenization · Hugging Face transformers
- Convolution
- A mathematical operation where a small filter matrix is slid across an input grid, computing a dot product at every position to produce a feature map that highlights where the filter's pattern appears. In neural networks, the filter values are learned during training rather than designed by hand. See also: dot, matmul, @ · Matrix multiplication
- Cosine Similarity
- A measure of how similar two vectors are, computed as the cosine of the angle between them — ranging from –1 (opposite) through 0 (unrelated) to 1 (identical direction). In recommenders and NLP, items or documents are represented as vectors and cosine similarity finds the closest matches regardless of their magnitude, so a short and a long document about the same topic still score highly. Learn more → See also: Embeddings · Embeddings
- Cross-Validation
- A technique for estimating how well a model generalizes by splitting the data into multiple folds, training on some folds and evaluating on the held-out fold, then rotating which fold is held out. The resulting average score is far more reliable than a single train-test split because every example gets a turn as test data. Learn more → See also: Cross-Validation: k-fold, LOO, Stratified · Hyperparameter tuning
- CTE (Common Table Expression)
- A named, temporary result set defined at the top of a query with the WITH keyword, scoped to that single statement. CTEs let you break a complex query into readable, named building blocks rather than nesting subqueries. Learn more → See also: Recursive CTEs · Subqueries
D
- Data Cleaning
- The process of detecting and correcting errors, inconsistencies, and missing values in a raw dataset so it is fit for analysis. Typical tasks include standardising date formats, fixing misspelled category labels, removing duplicate rows, and handling nulls. Analysts routinely spend 60–80 % of project time here because model quality cannot exceed data quality. See also: Normalization, Discretization, Sampling, Compression · Data-centric AI
- Data Lake
- A storage repository that holds raw data in its native format — structured, semi-structured, or unstructured — at virtually unlimited scale and low cost. Schema is applied only when the data is read, giving flexibility at the expense of governance if left unmanaged. Learn more → See also: Delta Lake & MERGE · Star vs Snowflake Schemas
- Data Leakage
- The accidental inclusion of information from the future or from the test set into the training process, causing a model to look better than it really is. Classic examples include scaling features using test-set statistics or including a column that is only available after the prediction target is already known. Learn more → See also: Training-Serving Skew · Drift & monitoring
- Data Modeling
- The discipline of deciding how to represent real-world entities and their relationships as database tables, columns, and constraints. A good model makes queries intuitive, prevents data anomalies, and reflects the way the business actually uses the data. Learn more → See also: ER Model & Mapping to Relations · The Relational Model
- Data Pipeline
- A sequence of automated steps that moves data from one or more sources through transformations to a destination. Each step's output is the next step's input, creating a reliable, repeatable flow from raw source to consumable dataset. See also: Normalization, Discretization, Sampling, Compression · Orchestration: Airflow & DAGs
- Data Skew
- An uneven distribution of data across partitions in a distributed system, where one or a few partitions hold far more rows than the rest. Skew causes some workers to finish in seconds while others process for hours, making them the bottleneck for the entire job. See also: Skew & salting · Training-Serving Skew
- Data Warehouse
- A central repository that integrates structured, cleansed data from multiple operational systems, organized for analytical queries rather than transactional writes. Warehouses use column-oriented storage and schemas like star or snowflake to accelerate business intelligence workloads. Learn more → See also: Star vs Snowflake Schemas · Reverse ETL
- Data Wrangling
- Reshaping and transforming raw data into the structure a model or analysis requires — broader than cleaning alone. Wrangling includes merging tables from multiple sources, pivoting from long to wide format, computing derived columns, and encoding categorical variables. The term emphasises the messy, iterative nature of real-world data preparation. See also: Normalization, Discretization, Sampling, Compression · pivot, melt, stack
- Dataclass
- A class decorated with `@dataclass` that auto-generates boilerplate methods — `__init__`, `__repr__`, `__eq__` — from the field annotations you declare. It keeps data-container code concise and readable. Dataclasses are ideal for typed record objects such as a `Transaction(amount: float, currency: str)` without writing repetitive constructor code. Learn more → See also: DataFrame basics · Classes & Instances
- DataFrame
- The core pandas data structure: a two-dimensional table with labelled rows and columns, like a spreadsheet you can manipulate with code. Each column is a Series sharing a common row index. DataFrames support SQL-style queries, joins, aggregations, and direct import from CSV, Excel, JSON, and databases. Learn more → See also: DataFrame intro · Series
- Decision Tree
- A model shaped like a flowchart that splits data by asking a series of yes/no questions about features, assigning a prediction at each leaf. Trees are highly interpretable but prone to overfitting; in practice they are almost always combined into ensembles like random forests or gradient-boosted trees. Learn more → See also: Decision Trees · Decision Trees: Entropy, Gini & Info Gain
- Decoder
- The portion of a model that generates output one step at a time, attending to both previously generated tokens and (in encoder-decoder architectures) the encoder's representation of the input. GPT-style language models are decoder-only, auto-regressively predicting the next token until a stopping condition is met. Learn more → See also: Speculative Decoding · BERT, GPT, T5
- Decorator
- A function that wraps another function to add behaviour — logging, timing, authentication — without touching the original code. In Python it is written as `@my_decorator` on the line above a function definition. Decorators are a clean way to separate cross-cutting concerns from core logic. Learn more → See also: Dunder Methods · FastAPI
- Denormalization
- The deliberate introduction of redundancy into a schema by merging or pre-joining tables so read queries need fewer joins. It trades write complexity and storage for faster analytical query performance. See also: Normalization, Discretization, Sampling, Compression · Deduplication
- Dictionary
- Python's built-in key-value store, written as `{key: value, ...}`. Looking up a value by key is O(1) on average, making dicts the go-to structure for fast lookups, counting frequencies, and grouping data. Keys must be immutable (strings, numbers, tuples); values can be anything. See also: Dictionaries · Hash Tables & Dicts
- Diffusion Model
- A generative model trained to reverse a gradual noise-adding process: it learns to predict and remove small amounts of Gaussian noise step by step until a clean sample emerges from pure noise. This iterative denoising produces strikingly high-quality images and audio and underpins tools like Stable Diffusion and DALL·E 3. Learn more → See also: Distillation · Gradient Descent (One Step)
- Dimension Table
- A table in a dimensional model that describes the who, what, where, and when of events stored in a fact table. Dimension tables are relatively small, change slowly, and hold descriptive attributes like customer name, product category, or store region. Learn more → See also: Slowly Changing Dimensions · The Relational Model
- Dimensionality
- The number of features (columns) in a dataset. High dimensionality creates the 'curse of dimensionality': data points become sparse in high-dimensional space, distances lose meaning, and models overfit. Techniques like PCA, feature selection, and embeddings reduce dimensionality to keep models accurate and computationally tractable. See also: Curse of Dimensionality · PCA & Dimensionality Reduction
- Dropout
- A regularization technique that randomly deactivates a fraction of neurons on each training step, forcing the network to learn redundant representations rather than relying on any single pathway. At inference time, all neurons are kept active but their outputs are scaled to compensate, giving an ensemble-like effect without the cost of training multiple models. Learn more → See also: Multi-Layer Perceptron & Activations · Ridge Regression & Regularization
E
- EDA (Exploratory Data Analysis)
- An open-ended investigation of a new dataset using summary statistics, distributions, and visualisations to understand its structure, spot anomalies, and form hypotheses before building any model. EDA surfaces problems — wrong data types, unexpected outliers, class imbalances — that would silently ruin a model trained without it. It is the non-negotiable first step in any data project. See also: Heatmaps & pairplot · Why visualization matters
- ELT (Extract, Load, Transform)
- A variant of ETL where raw data is loaded into the destination warehouse first, and transformations are performed inside that warehouse using its own compute. This approach exploits the massive parallel processing power of modern cloud warehouses. Learn more → See also: Reverse ETL · Normalization, Discretization, Sampling, Compression
- Embedding
- A dense, low-dimensional vector that represents a discrete object—a word, token, or entity—in a continuous space where geometric distance reflects semantic similarity. Embeddings are learned during training and compress sparse one-hot identifiers into compact representations that the rest of the network can reason over. See also: Embeddings · Embeddings
- Encoder
- The portion of a model that reads the input and compresses it into a meaningful internal representation—a sequence of vectors in transformers, or a latent code in autoencoders. BERT-style models are encoder-only, excelling at tasks that require understanding the full input before producing an output. Learn more → See also: Embeddings · BERT, GPT, T5
- Epoch
- One complete pass through the entire training dataset, after which the model has seen every example once. Training typically runs for many epochs, with the model improving each time until validation performance stops rising or starts degrading. See also: Cross-Validation: k-fold, LOO, Stratified · Train/val/test & CV
- ETL (Extract, Transform, Load)
- A data movement pattern where data is pulled from source systems, cleaned and reshaped in a separate processing engine, and then loaded into the destination store in its final form. Transformation happens before the data lands in the target. Learn more → See also: Reverse ETL · Normalization, Discretization, Sampling, Compression
F
- F1 Score
- The harmonic mean of precision and recall, ranging from 0 to 1. It gives a single number that balances both concerns, and is especially useful when classes are imbalanced and plain accuracy would be misleading. Learn more → See also: Confusion Matrix, Precision, Recall, ROC · Evaluating recommenders (precision@k, NDCG)
- Fact Table
- The central table in a dimensional model that stores measurable, quantitative events — such as sales transactions or web page views — along with foreign keys linking to surrounding dimension tables. Fact tables are typically very wide and very tall. Learn more → See also: Star vs Snowflake Schemas · The Relational Model
- Feature
- An input variable fed into a machine-learning model — a column in your training table that the model can use to make predictions. Raw data is rarely ready to use directly; feature engineering transforms it into meaningful signals, for example converting a raw timestamp into 'day of week' or 'hours since last purchase'. See also: Lag & rolling features · Feature stores (Feast, Tecton)
- Feature Engineering
- The craft of transforming raw data into inputs that expose useful signal for a model — creating ratio features, extracting day-of-week from a timestamp, binning ages into groups, or combining columns into meaningful interactions. Good feature engineering often matters more than model choice. See also: Feature stores (Feast, Tecton) · Lag & rolling features
- Feature Scaling
- Rescaling input features so they share a comparable numeric range, preventing features with large units (e.g., income in dollars) from dominating those with small ones (e.g., age in years). Algorithms that rely on distances or gradients — like KNN, SVM, or neural networks — require this; tree-based models do not. See also: Lag & rolling features · L1, L2, Elastic Net
- Few-Shot Learning
- Getting a model to perform a task correctly by including a small number of worked examples directly in the prompt, with no gradient updates to the model's weights. Large language models can generalize from as few as one to five examples because pre-training exposed them to the same patterns in diverse contexts. See also: Few-shot & chain-of-thought · Distillation
- Fine-Tuning
- Continuing to train a pre-trained model on a smaller, task-specific dataset so its weights shift toward the new domain without forgetting everything learned during pre-training. Fine-tuning is orders of magnitude cheaper than training from scratch and typically outperforms both random initialization and prompt-only approaches on specialized tasks. See also: Distillation · LoRA & QLoRA fine-tuning
- Forecasting
- Predicting future values of a quantity based on its past behaviour and, optionally, related external variables. Unlike classification or regression on static data, forecasting must respect temporal order — you can never train on future data and test on the past. Common approaches range from ARIMA and exponential smoothing to gradient-boosted trees and neural networks. See also: Evaluating forecasts (walk-forward) · Smoothing & Forecast Error
- Foreign Key
- A column in one table whose values must match an existing primary key value in another table. It enforces referential integrity: you cannot store an order for a customer who does not exist. See also: Keys & Integrity Constraints · Functional Dependencies & Closure
- Foundation Model
- A large model pre-trained on broad, diverse data at enormous scale, intended as a general-purpose base that can be adapted—via fine-tuning, prompting, or retrieval—to a wide range of downstream tasks. The term captures the architectural shift from training task-specific models from scratch to building everything on a shared, reusable backbone. See also: Establishing baselines · Mixture of Experts
G
- GAN
- A Generative Adversarial Network pits two networks against each other: a generator that produces fake samples and a discriminator that tries to distinguish fakes from real data. Training drives the generator to produce increasingly realistic outputs until the discriminator can no longer tell them apart, yielding a model capable of synthesizing images, audio, or video. Learn more → See also: Hybrid & neural recommenders · Multi-Layer Perceptron & Activations
- Generator
- A function that yields values one at a time instead of computing and returning them all at once. Because it produces each item only when asked, a generator uses a fraction of the memory that a list would — critical when processing millions of rows or reading large files. You recognise one by the `yield` keyword inside the function body. Learn more → See also: Comprehensions · Speculative Decoding
- GPT
- A transformer decoder pre-trained to predict the next token in a sequence, trained on massive web text with no task-specific labels. Scaling GPT—more parameters, more data, more compute—produces models that can write, reason, code, and follow instructions without ever being explicitly taught those tasks. Learn more → See also: The Transformer Architecture · Speculative Decoding
- Gradient Boosting
- An ensemble method that builds trees sequentially, where each new tree is trained to correct the residual errors left by all previous trees. Because each tree targets what the model currently gets wrong, the ensemble improves steadily — making gradient boosting (e.g., XGBoost, LightGBM) the dominant algorithm in tabular-data competitions. Learn more → See also: Decision trees · Random forests
- Gradient Descent
- An iterative optimization algorithm that repeatedly nudges a model's parameters in the direction that most steeply reduces the loss function, like descending a hill by always stepping downhill. The size of each step is controlled by the learning rate. Learn more → See also: Gradient Descent (One Step) · Partial derivatives
- GROUP BY
- A SQL clause that partitions rows into buckets sharing the same value in one or more columns, then applies an aggregate function to each bucket. Every column in the SELECT that is not inside an aggregate must appear in the GROUP BY list. Learn more → See also: ORDER BY, LIMIT, DISTINCT · GroupBy
- GroupBy
- A pandas operation that splits a DataFrame into groups by the unique values of one or more columns, applies an aggregation or transformation to each group, and combines the results — the split-apply-combine pattern. `df.groupby('country')['revenue'].sum()` computes total revenue per country in one readable line. Learn more → See also: Aggregates & GROUP BY · Method chaining
H
- Hallucination
- When a language model generates factually wrong or entirely fabricated information with the same confident tone it uses for accurate facts. It happens because the model is trained to produce plausible-sounding token sequences, not to verify claims against a ground truth, making retrieval augmentation and careful prompting necessary safeguards. See also: Speculative Decoding · What an LLM is
- Hyperparameter
- A configuration setting chosen before training that governs how the learning algorithm works — such as the number of trees in a forest, the depth of a neural network, or the regularization strength. Unlike model parameters (which are learned from data), hyperparameters are set by the practitioner, often via cross-validated search. See also: Hyperparameter tuning · L1, L2, Elastic Net
I
- Idempotency
- The property of an operation that produces the same result no matter how many times it is executed. A pipeline step is idempotent if re-running it after a failure does not duplicate or corrupt data. See also: Projections & Idempotent Matrices · Inverse & Invertibility
- Index
- A separate data structure the database maintains alongside a table so it can find rows matching a condition without scanning every row. Think of it as a book's back-index: instead of reading every page, you jump straight to the right one. See also: File Organization & Indexing · Hash Tables & Dicts
- Inference
- Running a trained model on new inputs to produce predictions, as opposed to the training phase where weights are updated. Inference speed and cost dominate production AI systems because a model may be trained once but queried billions of times, making efficiency techniques like batching, quantization, and caching economically critical. See also: Approximate Inference: Sampling · Speculative Decoding
- Inner Join
- A JOIN that returns only the rows for which a matching row exists in both tables. Rows that have no counterpart in the other table are silently excluded from the result. Learn more → See also: LEFT, RIGHT, FULL · Anti-joins
- Iterator
- Any object that delivers one item at a time via `next()` and raises `StopIteration` when exhausted. For-loops, comprehensions, and `zip()` all work by calling `next()` behind the scenes. Iterators let Python process sequences of any length without loading everything into memory at once. See also: Iterators · Generators
J
- JOIN
- An operation that combines rows from two or more tables based on a matching condition between their columns. The result is a single virtual table that can draw columns from all participating sources. Learn more → See also: Joins & Division · LEFT, RIGHT, FULL
- JSON
- JavaScript Object Notation — a lightweight, human-readable text format for structured data built from nested objects (key-value pairs) and arrays. It has become the default language for web APIs. Python's `json` module converts a JSON string into a Python dict in one call: `data = json.loads(response.text)`. See also: Structured outputs · Pydantic for LLM outputs
K
- K-Means Clustering
- An unsupervised algorithm that partitions data into k groups by repeatedly assigning each point to its nearest cluster center, then recalculating the centers as the mean of their assigned points, until assignments stop changing. The user must specify k in advance, which is often determined by examining the inertia curve (the 'elbow method'). See also: k-means & k-medoid Clustering · k-Nearest Neighbours
- K-Nearest Neighbors (KNN)
- A simple algorithm that classifies or regresses a new point by taking a vote (or average) among its k closest training examples in feature space. There is no explicit training phase — the entire dataset is the model — so prediction is slow on large datasets, and the method is sensitive to the scale of features. See also: k-Nearest Neighbours · k-means & k-medoid Clustering
- Kafka
- A distributed, fault-tolerant event-streaming platform that stores ordered streams of messages in topics and delivers them to consumers at high throughput. Producers append messages to a topic's log; each consumer group reads from its own offset, so messages can be replayed without coordination. See also: Sessionization · Bloom Filters, HyperLogLog & MinHash-LSH
L
- L1 Regularization (Lasso)
- A regularization method that adds the sum of absolute values of the model's weights to the loss function. Because it can push individual weights all the way to zero, it acts as an automatic feature selector — useful when you suspect only a handful of features truly matter. Learn more → See also: Ridge Regression & Regularization · Logistic Regression
- L2 Regularization (Ridge)
- A regularization method that adds the sum of squared weights to the loss function, pushing all weights toward zero but rarely to exactly zero. It handles correlated features gracefully and is the default choice when you want smooth, stable coefficients rather than feature elimination. Learn more → See also: Ridge Regression & Regularization · Logistic Regression
- Label
- The known answer for a training example in a supervised learning problem — the column your model is trying to predict. In an email spam classifier the label is 'spam' or 'not spam'; in a house-price model it is the actual sale price. Models learn by comparing their guesses against labels and adjusting until errors are small. See also: Logistic regression · Supervised vs Unsupervised; Train/Test
- Lakehouse
- An architecture that layers transactional table formats (such as Delta Lake or Apache Iceberg) on top of cheap object storage, giving a data lake ACID guarantees and SQL-queryable structure previously only available in warehouses. Learn more → See also: Delta Lake & MERGE · OLTP vs OLAP
- Lambda
- A throwaway, anonymous function defined in a single expression: `lambda x: x * 2`. Use it when you need a simple function as an argument — for example, passing a custom sort key — and it is too small to justify a named `def`. Lambdas cannot contain statements, loops, or multiple lines. See also: Function/tool calling · Chains & LCEL
- Large Language Model
- A neural network with billions of parameters trained on internet-scale text to predict tokens, which emerges with broad abilities—summarization, translation, question answering, code generation—that were never explicitly programmed. 'Large' refers both to parameter count and to the data and compute required, which together enable capabilities absent in smaller models. See also: What an LLM is · Speculative Decoding
- Latent Space
- The lower-dimensional internal representation space that a model uses to encode its inputs, where similar concepts cluster together and relationships can be explored by arithmetic on vectors. Generative models use the latent space as the 'creative canvas': sampling or editing a point in latent space and decoding it produces a corresponding output. See also: Embeddings · Vector Spaces & Subspaces
- Learning Rate
- A number that controls how large each gradient descent step is. Too high and the model overshoots the minimum, oscillating or diverging; too low and training is painfully slow. Finding the right learning rate — or scheduling it to decrease over time — is one of the most impactful hyperparameter decisions in deep learning. Learn more → See also: Learning-rate schedules · Gradient Descent (One Step)
- Left Join
- A JOIN that returns every row from the left table plus any matching rows from the right table. When no match exists in the right table, the right-side columns appear as NULL rather than being dropped. Learn more → See also: INNER JOIN · Joins & Division
- List Comprehension
- A compact Python syntax for building a new list by applying an expression to each item in an iterable, optionally filtering items with a condition — all in one line. It replaces a multi-line for-loop with something like `[x**2 for x in range(10) if x % 2 == 0]`. List comprehensions are faster than equivalent loops because Python optimises their bytecode. Learn more → See also: Lists · Control Flow
- LoRA
- Low-Rank Adaptation freezes a pre-trained model's weights and injects small trainable rank-decomposition matrices beside each attention layer, adapting the model to a new task with a fraction of the parameters and memory a full fine-tune would require. The original weights are untouched, so multiple LoRA adapters can be swapped in and out of the same base model cheaply. Learn more → See also: LLMOps — operating LLMs · Drift & monitoring
- Loss Function
- A mathematical formula that quantifies how wrong a single prediction is by comparing it to the true answer. During training, the optimizer minimizes this quantity across examples; the choice of loss function encodes what kinds of errors matter — mean squared error penalizes large mistakes heavily, while cross-entropy is suited to probability outputs. See also: Loss Functions · Logistic Regression
- LSTM
- Long Short-Term Memory is an RNN variant with an explicit memory cell and three learned gates—input, forget, and output—that control what information is written, erased, and read at each step. The gating mechanism allows LSTMs to carry relevant context across hundreds of time steps without vanishing gradients. See also: Multi-Layer Perceptron & Activations · Agent Memory
M
- Materialized View
- A database object that stores the pre-computed results of a query on disk rather than recomputing them on every request. Queries against it are instant, but the view must be refreshed periodically to reflect changes in the underlying data. See also: Window functions · Slowly Changing Dimensions
- MCP
- The Model Context Protocol is an open standard that defines how AI agents discover and call external tools and data sources through a uniform interface, much like HTTP standardized web communication. By separating the tool definitions from the model, MCP lets the same agent code connect to any compliant server without bespoke integration work. Learn more → See also: FastMCP — build MCP servers fast · MCP vs A2A vs ACP vs ANP
- Merge
- Combining two DataFrames horizontally by matching rows on a shared key column, exactly like a SQL JOIN. `pd.merge(orders, customers, on='customer_id', how='left')` attaches customer details to every order row, keeping all orders even when no customer record exists. The `how` parameter controls which rows survive: inner, left, right, or outer. Learn more → See also: Joins in Spark · INNER JOIN
- Missing Data
- Values that are absent from a dataset, represented in pandas as `NaN` (for numeric columns) or `None` / `pd.NA`. Missing data can bias statistics and crash models if ignored. Analysts either drop affected rows, fill them with a substitute value (mean, median, a placeholder), or flag them with an indicator column. Learn more → See also: DataFrame basics · Mean, Median, Mode & z-scores
- Multi-Head Attention
- An extension of self-attention that runs several independent attention computations in parallel, each with its own learned projections, then concatenates their outputs. Different heads can learn to track different types of relationships simultaneously—syntax in one head, coreference in another—giving the model richer contextual representations. Learn more → See also: Self-attention · The Transformer Architecture
- Mutability
- Whether an object's contents can be changed after it is created. Lists, dicts, and sets are mutable — you can add, remove, or replace items in place. Strings, tuples, and integers are immutable — any 'change' produces a new object. Mutability matters because passing a mutable object to a function lets that function silently alter the original. See also: Lists, Tuples, Dicts, Sets & Gotchas · Tuples
N
- NaN
- Short for 'Not a Number' — the floating-point sentinel value used to represent a missing or undefined numeric result. In pandas, `NaN` propagates silently: adding any number to `NaN` gives `NaN`, so a single missing value can corrupt an entire aggregation. Use `df.isna()` to locate NaNs before they cause silent errors. Learn more → See also: Numbers · Random & reproducibility
- Neural Network
- A computational system built from layers of interconnected nodes that transform raw input data into predictions by learning weighted connections during training. Each layer extracts progressively more abstract features, enabling the network to recognize patterns too complex for hand-coded rules. See also: Perceptron & the Update Rule · Multi-Layer Perceptron & Activations
- Normalization
- The process of organizing table columns and relationships to remove redundant data and update anomalies. A normalized design stores each fact in exactly one place so changing it once keeps the whole database consistent. See also: Normal Forms: 1NF to BCNF · Normalization, Discretization, Sampling, Compression
- NumPy Array
- A fixed-type, multi-dimensional grid of numbers stored in a single contiguous block of memory. Unlike a Python list, every element has the same data type, so NumPy can apply C-speed loops over the entire array without the overhead of Python object handling. It is the universal currency of numerical Python: pandas, scikit-learn, and TensorFlow all accept or return NumPy arrays. Learn more → See also: The ndarray · Why NumPy
O
- OLAP (Online Analytical Processing)
- A database usage pattern designed for queries that scan millions of rows to compute aggregates, trends, and comparisons across large historical datasets. OLAP systems trade write speed for fast analytical read throughput. Learn more → See also: Concept Hierarchies & Measures · Warehouse, Lake & Lakehouse
- OLTP (Online Transaction Processing)
- A database usage pattern optimized for high-volume, short-lived read/write operations that affect a small number of rows at a time — like inserting an order or updating an account balance. OLTP systems prioritize low latency and ACID compliance. Learn more → See also: ETL vs ELT · Warehouse, Lake & Lakehouse
- One-Hot Encoding
- A way to represent a categorical variable with no natural order (such as country or color) by replacing it with a set of binary columns, one per category, where exactly one column is 1 and the rest are 0. This lets algorithms treat categories as distinct without implying a false numeric ranking between them. See also: Positional encodings & RoPE · Rank, Nullity & Solution Sets
- Optimizer
- The algorithm that uses computed gradients to update a model's weights, determining how quickly and in which direction each parameter moves. The choice of optimizer—and its hyperparameters such as learning rate—has a major influence on whether training converges, how fast it does, and the quality of the final model. Learn more → See also: Gradient Descent (One Step) · Partial derivatives
- Orchestration
- The coordination of when, in what order, and on what schedule individual pipeline tasks run, including handling retries on failure and alerting on errors. An orchestrator manages the dependencies between tasks so each one starts only when its inputs are ready. Learn more → See also: Workflows · Agentic design patterns
- Outlier
- A data point that lies far outside the typical range of the other values. Outliers can be legitimate extreme events (a billionaire in an income survey), measurement errors, or data-entry mistakes. Left untreated they skew means, distort regression lines, and confuse distance-based models; whether to remove, cap, or keep them depends on why they exist. See also: Skew & salting · Training-Serving Skew
- Overfitting
- A model that has memorized the training data's noise and quirks rather than the underlying pattern — so it scores impressively on training examples but fails on new ones. The fix is more data, regularization, early stopping, or a simpler model architecture. Learn more → See also: Data-centric AI · Training-Serving Skew
P
- P-Value
- The probability of observing a result at least as extreme as the one you got, assuming the null hypothesis is true. A small p-value (e.g., below 0.05) means the data would be very surprising if nothing were going on — but it does not measure how large or practically important the effect is. Learn more → See also: A/B testing · Expected Value
- Parquet
- An open-source, columnar file format designed for efficient storage and fast analytical reads of large datasets. It embeds the schema inside the file, supports predicate pushdown, and achieves high compression ratios, making it the de-facto standard for data lake storage. Learn more → See also: Warehouse, Lake & Lakehouse · Delta Lake & MERGE
- Partitioning
- The division of a large table into smaller, physically separate segments based on the values of one or more columns, such as date or region. Queries that filter on the partition key skip irrelevant segments entirely, dramatically reducing I/O. See also: Partitioning · Columnar Storage & Parquet
- Perceptron
- The simplest neural unit: it multiplies each input by a learned weight, sums the results, adds a bias, and passes the total through an activation function to produce one output. A single perceptron can separate two linearly separable classes; stacking many of them creates a deep network capable of far richer boundaries. See also: Perceptron & the Update Rule · Multi-Layer Perceptron & Activations
- Pickle
- Python's built-in serialisation format that converts any Python object — a trained model, a complex dict, a custom class — into bytes that can be saved to disk and loaded back later. It is convenient for caching expensive computation between sessions, but pickle files from untrusted sources can execute arbitrary code on load, so never unpickle data you did not create yourself. See also: DataFrame basics · Lists
- Pivot Table
- A reshaped summary table where unique values of one column become the new column headers, allowing you to see aggregated metrics across two dimensions at a glance. In pandas, `pd.pivot_table(df, values='sales', index='region', columns='product', aggfunc='sum')` turns long-format transaction data into a region-by-product sales matrix. See also: pivot, melt, stack · GroupBy
- Pooling
- A downsampling operation that summarizes small regions of a feature map into single values—typically the maximum or average—reducing spatial dimensions and making representations more robust to small translations. It shrinks the computation required for subsequent layers while retaining the most salient detected features. See also: Distillation · Quantization
- Population
- The complete set of individuals or observations you want to draw conclusions about. Because measuring every member of a population is often impractical, analysts work with samples and use statistics to infer population properties. Knowing your intended population prevents scope creep — a model trained on US customers does not automatically generalise to global ones. See also: Sampling & Reservoir Sampling · Approximate Inference: Sampling
- Positional Encoding
- A signal added to each token's embedding before it enters the transformer to tell the model the token's position in the sequence, since self-attention itself is order-agnostic. Fixed sinusoidal patterns or learned vectors both work; without positional encoding, the model would treat 'dog bites man' and 'man bites dog' identically. Learn more → See also: Embeddings · The Transformer Architecture
- Precision
- Of all the examples a model labeled positive, the fraction that truly are positive. High precision means few false alarms — critical in spam filters where legitimate mail must not be mislabeled. Learn more → See also: Model Calibration · Confusion Matrix, Precision, Recall, ROC
- Primary Key
- A column or set of columns whose values uniquely identify every row in a table. No two rows can share the same primary key value, and the column cannot be NULL, making it the definitive address for any record. See also: Keys & Integrity Constraints · Rank, Nullity & Solution Sets
- Principal Component Analysis (PCA)
- A technique that reduces the number of features by finding new axes — principal components — that capture the directions of greatest variance in the data, then projecting the data onto the top few components. It compresses information, removes correlated features, and can make downstream models faster and less prone to the curse of dimensionality. Learn more → See also: PCA & Dimensionality Reduction · Projections & Idempotent Matrices
- Prompt Engineering
- The craft of designing the text input to a language model—instructions, examples, context, formatting—to reliably elicit the desired output without changing the model's weights. Because LLMs are sensitive to phrasing, a well-designed prompt can dramatically improve accuracy, safety, or output structure on the same underlying model. See also: Prompt patterns that work · LLMOps — operating LLMs
Q
- Quantization
- Reducing the numerical precision of a model's weights—from 32-bit floats to 8-bit or 4-bit integers—to shrink its memory footprint and speed up inference with minimal accuracy loss. Quantized models run on consumer GPUs and even CPUs that could not fit the full-precision version, democratizing access to large models. See also: Quantization · Distillation
- Query Optimizer
- The component inside a database engine that analyzes a SQL statement and selects the most efficient execution plan — deciding join order, which indexes to use, and how to parallelize work — before any data is touched. See also: Adaptive Query Execution · SELECT basics
R
- RAG
- Retrieval-Augmented Generation combines a language model with a retrieval system: at query time, relevant documents are fetched from an external store and injected into the prompt so the model can ground its answer in up-to-date, verifiable text rather than relying solely on memorized training knowledge. This reduces hallucinations and avoids the need to retrain the model whenever facts change. Learn more → See also: Advanced RAG · RAG evaluations
- Random Forest
- An ensemble of decision trees, each trained on a random bootstrap sample of the data with a random subset of features considered at each split. Averaging many diverse, slightly wrong trees cancels out their individual errors, producing a robust model that rarely overfits as badly as a single tree. Learn more → See also: Decision Trees · Decision trees
- RDD (Resilient Distributed Dataset)
- Spark's foundational distributed data abstraction: an immutable, partitioned collection of records spread across a cluster that can be recomputed from its lineage if a partition is lost. Modern Spark code favors DataFrames, but RDDs remain the underlying execution primitive. See also: DataFrame intro · Databricks platform overview
- Recall
- Of all examples that actually are positive, the fraction the model correctly identified. High recall means few real positives are missed — critical in cancer screening where missing a true case is far costlier than a false alarm. Learn more → See also: Confusion Matrix, Precision, Recall, ROC · Model Calibration
- Recommender System
- A model that predicts which items a user is most likely to find relevant and surfaces them proactively — products, movies, articles, songs. Recommenders power a huge share of internet engagement. The two dominant families are collaborative filtering (learn from collective user behaviour) and content-based filtering (match item attributes to user preferences). Learn more → See also: Content-based filtering · The utility matrix
- Regularization
- A family of techniques that add a penalty to a model's loss function to discourage it from growing overly complex, reducing overfitting. The two most common forms are L1 (which zeroes out unimportant features) and L2 (which shrinks all weights toward zero without eliminating any). Learn more → See also: Ridge Regression & Regularization · Dropout, BN, LN
- ReLU
- Short for Rectified Linear Unit, it outputs the input unchanged if positive and zero otherwise—a simple rule that trains faster than older activations and avoids the vanishing-gradient problem for most layers. Its sparsity (many neurons output zero) also acts as a mild regularizer. Learn more → See also: Multi-Layer Perceptron & Activations · Ridge Regression & Regularization
- REST
- Representational State Transfer — an architectural style for web APIs that maps operations to HTTP verbs (GET to read, POST to create, PUT/PATCH to update, DELETE to remove) and addresses resources with URLs. A REST API for a user database might use `GET /users/42` to fetch user 42 and `POST /users` to create a new one. Most public data APIs follow REST conventions. See also: Serving with FastAPI · The Relational Model
- RLHF
- Reinforcement Learning from Human Feedback trains a reward model on human preference judgments—which of two outputs is better—then uses that reward signal to fine-tune the language model via reinforcement learning. This is how instruction-following and safety behaviors are instilled in models like ChatGPT, aligning outputs with what humans actually want rather than raw token prediction. See also: Implicit vs explicit feedback · Human-in-the-loop
- RNN
- A Recurrent Neural Network processes sequences one element at a time, passing a hidden state forward so each step can depend on everything seen so far. This makes RNNs naturally suited for text and time-series, but long sequences cause vanishing gradients that prevent the network from retaining early context. See also: The autoregressive loop · Backpropagation foundations
- ROC AUC
- The area under the Receiver Operating Characteristic curve, which plots true positive rate against false positive rate at every possible classification threshold. A score of 0.5 means the model is no better than random guessing; 1.0 means perfect discrimination — making AUC a threshold-independent measure of ranking quality. Learn more → See also: Confusion Matrix, Precision, Recall, ROC · Model Calibration
S
- Sample
- A subset of individuals drawn from a larger group and used to draw conclusions about that group. The quality of a sample matters enormously: a biased sample produces misleading results no matter how sophisticated the analysis. Randomly selecting samples and ensuring they are representative of the population is a foundational discipline in statistics. See also: Sampling & Reservoir Sampling · Approximate Inference: Sampling
- Sampling Bias
- A systematic skew that occurs when the sample used for analysis is not representative of the population you want to draw conclusions about. Unlike random sampling error (which averages out with more data), sampling bias cannot be corrected by collecting more biased data. See also: Skew & salting · The Bias-Variance Trade-off
- Schema
- A formal description of how data is structured — the tables, their columns, each column's data type, and the relationships among tables. In some databases the term also refers to a named namespace that groups related tables together. See also: Star vs Snowflake Schemas · The Relational Model
- Self-Attention
- A mechanism that computes, for each token in a sequence, a weighted mixture of all other tokens' representations, where the weights reflect how relevant each token is as context. This lets the model capture long-range dependencies—such as a pronoun referring to a noun several sentences back—in a single operation. Learn more → See also: Multi-head attention · Softmax
- Semantic Search
- A search method that matches queries to documents by meaning rather than by keyword overlap, using embedding vectors so that 'heart attack' and 'myocardial infarction' retrieve the same results. It replaces brittle exact-string matching with geometry: the closer two embeddings in vector space, the more semantically related the texts. See also: Embeddings · Embeddings
- Series
- A single labelled column of data in pandas — essentially a one-dimensional array with an index. Slicing one column from a DataFrame gives you a Series; combining multiple Series produces a DataFrame. Series operations are vectorised and align automatically on the index, so adding two Series with different row orders still gives correct results. Learn more → See also: Series · Indexing & slicing
- Set
- An unordered collection of unique elements that supports fast membership tests and set-algebra operations — union, intersection, difference. Adding a duplicate to a set simply does nothing. Use a set when you need to eliminate duplicates or ask 'is this value in the collection?' on large data efficiently. See also: Sets · Lists, Tuples, Dicts, Sets & Gotchas
- SettingWithCopyWarning
- A pandas warning fired when you try to modify a DataFrame that might be a view (a window into another DataFrame's memory) rather than an independent copy. Writing `df[df.age > 18]['score'] = 0` may silently do nothing if `df[df.age > 18]` is a view. The fix is to call `.copy()` explicitly or use `.loc` to assign in one step. Learn more → See also: Memory optimization · When O(n²) Kills Your DataFrame
- SGD
- Stochastic Gradient Descent updates weights using the gradient computed on a small random subset of the data rather than the full dataset. The randomness introduces noise that can help the model escape sharp local minima, and with momentum and a good learning-rate schedule it remains competitive with adaptive optimizers on many vision tasks. Learn more → See also: Gradient Descent (One Step) · Gradient descent
- Shuffle
- The most expensive operation in distributed processing: redistributing rows across all worker nodes so that all records sharing the same key end up on the same machine before a join or aggregation. Excessive shuffling is the primary cause of slow Spark jobs. See also: Shuffles · Skew & salting
- Sigmoid
- An S-shaped function that squashes any real number into the range (0, 1), making it natural for predicting probabilities from a single neuron. Its very flat tails cause vanishing gradients in deep networks, which is why it has largely been replaced by ReLU in hidden layers while remaining popular in output layers for binary classification. Learn more → See also: Logistic Regression · Logistic regression
- Simpson's Paradox
- A statistical phenomenon where a trend that appears in several separate groups reverses or disappears when the groups are combined. It typically arises when a lurking variable (a confounder) is distributed unevenly across groups, misleading aggregated comparisons. Learn more → See also: Trend, seasonality & decomposition · VAR (multivariate)
- Slowly Changing Dimension (SCD)
- A dimension whose attribute values change infrequently over time — for example, a customer's address or job title. SCD techniques define whether the old value is overwritten (Type 1), preserved in a new row with validity dates (Type 2), or stored alongside the current value (Type 3). Learn more → See also: Dimensional Modeling · Change Data Capture (CDC)
- Snowflake Schema
- An extension of the star schema where dimension tables are normalized into sub-dimensions, producing a branching structure that resembles a snowflake. It reduces data redundancy but requires more joins per query compared to a star schema. Learn more → See also: Star vs Snowflake Schemas · Schemas
- Softmax
- A function applied to a vector of raw scores that converts them into a valid probability distribution: all outputs are positive and sum to exactly 1. The highest score gets amplified disproportionately, making the predicted class stand out, and the result can be directly compared to a one-hot label using cross-entropy loss. Learn more → See also: Logistic regression · Logistic Regression
- Spark
- An open-source distributed processing engine that executes large-scale data transformations across a cluster of machines by holding intermediate results in memory rather than writing them to disk between steps. It supports SQL, streaming, machine learning, and graph processing through a unified API. See also: The Spark ecosystem · Shuffles
- Star Schema
- A dimensional model layout with one central fact table joined to multiple flat dimension tables. The diagram looks like a star — the fact table at the hub, dimension tables as points — and the design enables simple, fast analytical queries. Learn more → See also: Star vs Snowflake Schemas · Slowly Changing Dimensions
- Stationarity
- A time series is stationary when its statistical properties — mean, variance, autocorrelation — do not change over time. Most classical forecasting models assume stationarity because a shifting mean makes past patterns unreliable guides to the future. Non-stationary series are made stationary by differencing (subtracting consecutive values) or taking logarithms. Learn more → See also: Trend, seasonality & decomposition · Why time series is different
- Streaming
- A data processing model where records are processed continuously and incrementally as they arrive, with results updated in near-real time rather than waiting for an entire dataset to accumulate. Streaming systems trade some throughput for low end-to-end latency. See also: Streaming responses · Change Data Capture (CDC)
- Subquery
- A SELECT statement embedded inside another SQL statement, used to compute an intermediate result that the outer query then filters or joins against. When the inner query references columns from the outer query, it is called a correlated subquery. Learn more → See also: INNER JOIN · CASE expressions
- Surrogate Key
- A system-generated identifier — often an auto-incrementing integer or UUID — assigned to each row with no business meaning. It remains stable even if the natural business identifier changes, making it safer to use as a join key across time. Learn more → See also: Keys & Integrity Constraints · Slowly Changing Dimensions
T
- Target Variable
- The outcome a model is built to predict — used interchangeably with 'label' in supervised learning. Choosing the right target is a modelling decision, not a data decision: predicting 'will this customer churn in 30 days?' and 'how many days until this customer churns?' require different targets and different model types even from the same dataset. See also: Logistic regression · Drift & monitoring
- Temperature
- A scalar that controls how sharply a language model's probability distribution is peaked over possible next tokens: low temperature makes the model nearly deterministic and conservative, while high temperature flattens the distribution and introduces more variety and surprise. Setting temperature to 0 always picks the most likely token; setting it above 1 can produce creative but incoherent text. See also: Sampling: temperature, top-k, top-p · Softmax
- The GIL (Global Interpreter Lock)
- A mutex inside CPython that allows only one thread to execute Python bytecode at a time, even on multi-core machines. This makes single-threaded programs safe and simple but means pure-Python threads cannot speed up CPU-bound work. I/O-bound tasks (network calls, disk reads) still benefit from threading because the GIL is released while waiting. Learn more → See also: Multiprocessing · Threading
- Time Series
- A sequence of measurements recorded at successive, usually equally spaced points in time — stock prices, hourly temperatures, monthly sales. The defining feature is that observations are not independent: yesterday's value predicts today's. Analysing time series requires techniques that account for trend, seasonality, and autocorrelation. See also: Why time series is different · Trend, seasonality & decomposition
- Tokenization
- The process of splitting raw text into the discrete units—tokens—that a language model consumes. Tokens are neither words nor characters but something in between: common words become single tokens while rare words are split into subword pieces, balancing vocabulary size against the ability to handle unseen terms. See also: Tokenization · Hugging Face transformers
- Top-p Sampling
- A decoding strategy that at each step considers only the smallest set of tokens whose cumulative probability exceeds a threshold p, discarding the long tail of unlikely options before sampling. Unlike top-k, the number of candidates changes dynamically with the distribution, avoiding situations where the model samples from near-random low-probability tokens. See also: Sampling: temperature, top-k, top-p · Sampling & Reservoir Sampling
- Train-Test Split
- The practice of partitioning a dataset into a training portion (used to fit the model) and a test portion (used only to measure final performance). The test set must stay locked away during development; peeking at it — even to choose hyperparameters — inflates apparent performance. Learn more → See also: Supervised vs Unsupervised; Train/Test · Cross-Validation: k-fold, LOO, Stratified
- Transaction
- A sequence of database operations bracketed by BEGIN and COMMIT that the engine treats as a single indivisible unit. If any step fails, a ROLLBACK undoes every preceding step in that sequence as if none of it happened. See also: Joins & Division · Sessionization
- Transfer Learning
- The practice of reusing a model trained on one large task as a starting point for a different, usually smaller task, on the premise that general representations learned early transfer to the new domain. It is why pre-trained language and vision models dominate: a few thousand labeled examples plus a strong prior beats millions of examples and random initialization. See also: Distillation · Supervised vs Unsupervised; Train/Test
- Transformer
- An architecture that replaces sequential recurrence with self-attention, letting every token directly attend to every other token in a sequence in parallel. Transformers train much faster on modern hardware than RNNs and have become the dominant architecture for language, vision, and multimodal models. Learn more → See also: Multi-head attention · BERT, GPT, T5
- Tuple
- An immutable, ordered sequence of values, written with parentheses: `(1, 'a', 3.0)`. Because tuples cannot be changed, Python can store them more efficiently than lists, and they can be used as dictionary keys or set members. They are the natural type for a fixed record such as a (latitude, longitude) coordinate pair. See also: Tuples · Lists, Tuples, Dicts, Sets & Gotchas
- Type Hint
- Optional annotations that tell readers (and tools) what type a variable, parameter, or return value should be: `def add(x: int, y: int) -> int`. Python does not enforce them at runtime, but type-checkers like mypy and IDE autocomplete use them to catch mistakes early. They make large codebases dramatically easier to maintain. Learn more → See also: Functions · Pydantic v2
- Type I Error
- Rejecting a null hypothesis that is actually true — a false positive. In an A/B test, this means declaring a winner when the two variants perform identically. The significance level alpha (commonly 0.05) is the maximum Type I error rate you are willing to tolerate. Learn more → See also: A/B testing · z-test, t-test & chi-squared test
- Type II Error
- Failing to reject a null hypothesis that is actually false — a false negative. In an A/B test, this means missing a real improvement. Reducing Type II error requires a larger sample size or accepting a higher false positive rate. Learn more → See also: A/B testing · z-test, t-test & chi-squared test
U
- Underfitting
- A model too simple to capture the real structure in the data — it performs poorly even on training examples because it has not learned enough. The remedy is adding more features, increasing model complexity, or training for longer. Learn more → See also: Training-Serving Skew · Data-centric AI
V
- VAE
- A Variational Autoencoder learns to encode inputs into a probability distribution over a compact latent space rather than a single point, then decode samples from that distribution back to data. The probabilistic bottleneck forces the latent space to be smooth and continuous, enabling controlled generation by sampling or interpolating in latent space. Learn more → See also: Embeddings · VAR (multivariate)
- Vanishing Gradient
- A training failure where gradients shrink exponentially as they travel backward through many layers, leaving early layers learning almost nothing while later layers update normally. It was the main obstacle to training deep networks before ReLU activations, residual connections, and careful initialization became standard. See also: Gradient descent · Gradient Descent (One Step)
- Vector Database
- A storage system designed to index and query dense embedding vectors efficiently, enabling nearest-neighbor search across millions of items in milliseconds. It is the retrieval backbone of RAG pipelines: documents are embedded at ingest time, and at query time the closest embeddings to the question embedding are fetched as context. Learn more → See also: Vector databases · Embeddings
- Vectorization
- The practice of replacing Python loops with array-level operations that NumPy or pandas execute in compiled C code. `array * 2` multiplies every element at once instead of iterating in Python. Vectorised code is typically 10–100× faster than the equivalent loop because it avoids Python's per-iteration overhead. Learn more → See also: Vectorization vs Loops · Why NumPy
- Virtual Environment
- An isolated folder containing a private Python interpreter and its own package installations, created with `python -m venv`. It prevents version conflicts between projects: project A can use pandas 1.5 while project B uses pandas 2.0 on the same machine. Activating an environment puts its `python` and `pip` commands first on your PATH. See also: Environments & packaging: venv, uv, pex · Getting Started
W
- Window Function
- A SQL function that computes a value for each row by looking at a sliding "window" of related rows, without collapsing them into a single group. It lets you rank, compute running totals, or compare a row to its neighbors while keeping every row in the output. Learn more → See also: Window functions · Ranking functions
- Word2Vec
- A family of shallow neural networks trained to predict a word from its neighbors (or vice versa), with the side effect that the learned weight vectors capture meaningful semantic relationships. The famous result is that the vector arithmetic king − man + woman ≈ queen emerges purely from predicting context in a large corpus. See also: Embeddings · Embeddings
No terms match that filter.