Blog Production AI · GATE DA · how concepts actually work

From theory to the systems that demand real understanding.

Long-form pieces on how production-AI teams orchestrate agents, serve models, and build the stack, plus GATE DA essays that turn probability, linear algebra, DBMS, ML, and AI into durable concepts.

369 posts · 13 categories

Latest LLMs · Jun 18, 2026

Activation checkpointing makes GPU memory a scheduling decision

Large-model training is constrained by more than weights. Activation checkpointing changes which forward tensors survive, trading recomputation for a smaller peak-memory footprint.

9 min read

Read story

RAG Jun 18, 2026

RAG poisoning is an evidence-integrity problem

A retrieval system can reason perfectly from corrupted evidence. Defending RAG means governing what enters the corpus, preserving provenance, isolating tenants, and treating retrieved text as untrusted data.

10 min read Read

Infrastructure Jun 18, 2026

Six system-design boundaries that prevent category mistakes

Stateless vs stateful, Lambda vs ECS, database vs cache, queue vs stream, retrieval vs reranking, and monitoring vs tracing—explained as operational contracts.

10 min read Read

Agents Jun 18, 2026

Token theft in AI agents is an architecture failure

Prompt injection gets the attention, but credentials turn a confused model into an authenticated attacker. The fix is to keep tokens out of model context and authorize every action at runtime.

11 min read Read

Infrastructure Jun 12, 2026

How vLLM actually serves a 7B model

Follow one request through vLLM — the scheduler, the KV-cache blocks, prefill vs decode, and what happens when 90,000 tokens of cache no longer fit.

11 min read Read

Infrastructure Jun 12, 2026

The vector-search memory wall: why HNSW eats RAM, and how quantization cuts the bill 32×

HNSW is fast because the whole graph lives in RAM — which is exactly why it gets expensive. At 100M vectors you're paying for ~600GB of memory before you serve a single query. Here's the math, and how binary quantization plus reranking is rewriting the cost model in 2026.

11 min read Read

LLMs Jun 11, 2026

MHA → MQA → GQA → MLA: the attention efficiency ladder

MHA, MQA, GQA, MLA — every rung shrinks the KV cache that bottlenecks LLM inference. What each step gives up to make models cheaper to run.

9 min read Read

Patterns Jun 10, 2026

AutoML raised the floor, not the ceiling

AutoGluon now tops the AutoML benchmark and beats rivals given a fraction of the time. That makes a strong baseline cheap — but the features and framing that actually win are exactly what AutoML can't automate.

6 min read Read

Agents Jun 10, 2026

Context engineering: why your agent gets dumber as its context grows

For multi-step agents the job shifted from writing the perfect prompt to curating the smallest high-signal set of tokens at every step. Because of context rot, more context literally makes agents worse.

6 min read Read

Infrastructure Jun 10, 2026

Don't auto-ship retrained models: collapse, feedback, and the challenger gate

Retraining can produce a worse model — from bad data, a pipeline bug, or a model quietly learning from its own outputs. Champion-challenger is how you automate retraining without ever shipping a regression.

6 min read Read

Patterns Jun 10, 2026

Feature engineering still beats the algorithm (even after TabPFN)

Tree-based models remain state-of-the-art on tabular data, and the biggest gains come from the features, not the model. Even as a Nature-published foundation model finally challenges gradient boosting, the data-centric lesson holds.

6 min read Read

Infrastructure Jun 10, 2026

The GPU isn't the bottleneck: why LLM serving is a memory problem

During generation the GPU spends most of its time moving the KV cache, not doing math. Whoever wastes the least memory serves the most users — which is why PagedAttention and continuous batching changed everything.

7 min read Read

Infrastructure Jun 10, 2026

Loading a model file can run code: MLSecOps in 2026

Downloading a model and calling load() is as dangerous as running a random shell script. Real malicious models have shipped on public hubs — here's the ML attack surface and the defenses that belong in your pipeline.

7 min read Read

Agents Jun 10, 2026

MCP won the tool-integration war — now comes the hard part

The Model Context Protocol became the USB-C of AI tools, and once OpenAI, Google, and Microsoft all adopted it the integration wars ended. 2026 is about making it enterprise-grade: stateless HTTP, a registry, apps, and long-running tasks.

7 min read Read

Infrastructure Jun 10, 2026

Most of your traffic is easy: cut LLM bills 40–85% with routing

Real query traffic is mostly simple, yet teams send everything to the most expensive model. Routing easy queries to cheap models, escalating only the hard ones, and caching repeats captures most of the quality at a fraction of the cost.

6 min read Read

Infrastructure Jun 10, 2026

Most predictions don't need real-time (and batch is 100x cheaper)

The default mental model of 'serving a model' is a live API answering in milliseconds. For most use cases that's the expensive wrong choice — and the hybrid precompute-to-Redis pattern gives you batch economics with real-time latency.

6 min read Read

Patterns Jun 10, 2026

Peeking is why your A/B test lies (and CUPED is the fix)

Watching a live experiment and stopping the moment it looks significant can push your false-positive rate from 5% to over 26%. The discipline — and the variance-reduction trick — that make online tests trustworthy.

6 min read Read

Agents Jun 10, 2026

Prompt injection is the SQL injection of the AI era

Both attacks come from the same root cause — the system can't separate trusted instructions from untrusted data. In agents it becomes a confused-deputy problem, and a single filter won't save you.

7 min read Read

LLMs Jun 10, 2026

Running an LLM on your laptop: GGUF, llama.cpp, and the Q-soup

A 7B model now fits on a laptop. GGUF, llama.cpp, and quantization tiers like Q4_K_M — decoded, so running LLMs locally stops being intimidating.

9 min read Read

Infrastructure Jun 10, 2026

Most ML failures are silent: the case for data contracts

An ML pipeline rarely crashes when the data goes wrong — it keeps serving confidently wrong predictions. Data contracts make a schema or semantics breach fail loudly at the source, before it poisons a retraining job.

6 min read Read

Patterns Jun 10, 2026

You can't be fair three ways — and the EU AI Act clock is ticking

A landmark result proves demographic parity, equalized odds, and calibration can't all hold when base rates differ. With high-risk obligations applying from August 2026, fairness is now an engineering pipeline, not a footnote.

7 min read Read

Infrastructure Jun 10, 2026

Your GPUs are mostly idle: FinOps for the AI era

AI spending is heading past $2 trillion while production GPU fleets often run under 50% utilization. The biggest lever on an ML bill isn't a cheaper price — it's the idle silicon you're already paying for.

6 min read Read

Patterns Jun 10, 2026

Your t-SNE plot is lying to you (three ways)

t-SNE and UMAP reveal clusters PCA hides — but cluster sizes, the gaps between clusters, and even the shapes are often artifacts. How to read these plots without fooling yourself, and when to reach for UMAP instead.

6 min read Read

Infrastructure Jun 8, 2026

AI's real bottleneck isn't intelligence — it's electricity

AI's hardest 2026 limit isn't chips or money — it's electricity. Why data-center power is the bottleneck, and how inference makes every AI query an energy cost.

7 min read Read

LLMs Jun 8, 2026

Attention is O(n²) — and Mamba's linear escape

Attention costs O(n²), so long context gets expensive fast. State-space models like Mamba scale linearly — and 2026's winning architectures are hybrids of both.

9 min read Read

LLMs Jun 8, 2026

Beyond next-token: world models and the next paradigm

World models predict the next state of the world, not the next token — making them simulators agents can plan inside. The two camps racing past LLMs in 2026.

8 min read Read

LLMs Jun 8, 2026

Diffusion language models: when AI writes text all at once

Diffusion language models generate text all at once, refining a masked sequence over a few parallel steps — hitting 1,000+ tokens/sec versus left-to-right LLMs.

8 min read Read

LLMs Jun 8, 2026

Reading a model's mind: sparse autoencoders explained

Sparse autoencoders pull human-readable features out of an LLM's tangled activations — the breakthrough tool of mechanistic interpretability.

8 min read Read

LLMs Jun 8, 2026

o3-level reasoning on your laptop: how distillation works

Reasoning distillation trains a small model on a big model's chains of thought — putting o3-level reasoning on a laptop, and beating bigger models.

8 min read Read

LLMs Jun 8, 2026

RLHF is being replaced: how DPO teaches models what good means

RLHF aligned chat models with a reward model and a fragile RL loop. DPO drops both, learning the same preferences directly from chosen-vs-rejected pairs.

8 min read Read

LLMs Jun 8, 2026

The big-model era is ending: the rise of small models

Small language models fine-tuned for a task now beat giants on it — on your laptop or phone, cheaper and private. Why bigger isn't always better in 2026.

7 min read Read

LLMs Jun 8, 2026

Why your AI can't learn after training: catastrophic forgetting

A trained model's weights are frozen, and fine-tuning erases old skills — catastrophic forgetting. Why continual learning is AI's open problem in 2026.

8 min read Read

LLMs Jun 8, 2026

Test-time compute: why thinking longer beats thinking bigger

Test-time compute lets a model think before answering, so a small reasoning model can beat a far larger one. How inference-time scaling works.

10 min read Read

Business Analytics Jun 7, 2026

A/B testing in practice: sample size, p-values, and the traps

A/B testing done wrong wastes months of effort. Master p-values, sample size, and the six traps — peeking, SRM, and more — before your next experiment.

13 min read Read

Business Analytics Jun 7, 2026

Cohort analysis: how to actually read a retention curve

Cohort analysis reveals what aggregate retention metrics hide — learn to build a cohort table, read a retention curve, and spot product-market fit signals.

12 min read Read

Time Series Jun 7, 2026

Evaluating forecasts: MAE, RMSE, MAPE, and honest backtesting

A practical guide to forecast accuracy using MAE RMSE MAPE and MASE, plus rolling-origin backtesting to avoid self-deception in time series.

12 min read Read

CLI Jun 7, 2026

find and xargs: bulk file operations without fear

Master the find command and xargs for safe, efficient bulk file operations: handle spaces in filenames, batch deletes, renames, and parallel processing.

11 min read Read

Time Series Jun 7, 2026

The forecasting baselines that quietly beat fancy models

Why forecasting baselines like the naive forecast and seasonal-naive outperform complex models—and how to pick the right one before you build anything fancy.

11 min read Read

Business Analytics Jun 7, 2026

Funnel analysis: finding exactly where users drop off

Master funnel analysis to pinpoint conversion leaks, prioritize fixes by impact, and turn step-by-step drop-off data into real growth.

11 min read Read

Git Jun 7, 2026

git bisect: find the commit that broke it in log(n) steps

Use git bisect to find the exact commit that introduced a regression or bug with binary search over history — O(log n) instead of O(n).

10 min read Read

Git Jun 7, 2026

Branching strategies that scale: trunk-based vs Git Flow vs GitHub Flow

Compare every major branching strategy: trunk-based development, Git Flow, and GitHub Flow — and know which fits your release cadence.

12 min read Read

Git Jun 7, 2026

Git merge vs rebase: when to use which (without wrecking history)

Git merge vs rebase both integrate branches but produce different histories. Learn when each is right, the golden rule, and how to recover when things go wrong.

12 min read Read

Git Jun 7, 2026

Git's three trees: the mental model that makes Git click

Understand git's working directory, staging area, and HEAD so every git command finally makes sense — no more mystery, no more fear.

11 min read Read

CLI Jun 7, 2026

grep, sed, awk: the text-processing trio worth mastering

grep sed awk command line text processing explained: when to use each tool, real recipes, regex fundamentals, and how pipelines compose all three.

13 min read Read

Business Analytics Jun 7, 2026

LTV and CAC: the unit economics every analyst should model

Master LTV and CAC unit economics: correct formulas, cohort methods, payback periods, and the pitfalls that make most models wrong.

12 min read Read

Business Analytics Jun 7, 2026

North Star metrics: the one number that actually moves a business

What a north star metric is, how to choose one, and why a single well-chosen number beats a dashboard of 40 KPIs.

11 min read Read

Time Series Jun 7, 2026

Stationarity, differencing, and why ARIMA needs a flat series

Understand stationarity, why differencing transforms a trending series, and how the d in ARIMA(p,d,q) bridges raw data to a forecastable model.

13 min read Read

Time Series Jun 7, 2026

Decomposition: reading trend, seasonality, and residual

A practical guide to time series decomposition — separating trend, seasonality, and residual to reveal what a signal is actually doing.

11 min read Read

CLI Jun 7, 2026

Understanding $PATH: how the shell actually finds your commands

Demystify the PATH environment variable and 'command not found': how the shell searches, why order matters, and how to manage it safely.

10 min read Read

Archive 320 earlier posts