Question 1

How do you choose between batch and real-time inference for a model?

Accepted Answer

Decide based on how fresh the prediction must be versus the cost and complexity of serving live. Use batch when results are needed every few hours or days, like daily churn lists, because it is cheap, simple, and can use spot or scheduled compute. Use real-time when a late or stale decision causes immediate loss, like fraud or ad auctions needing sub-100ms responses, accepting higher cost and complexity. Most production systems are hybrid: precompute heavy signals offline and do lightweight re-ranking online.

Question 2

Why isn't a git commit enough to reproduce an ML training run?

Accepted Answer

A git commit captures code, but an ML run also depends on the exact training data, hyperparameters, environment, and randomness, none of which live in Git. Datasets are too large for Git and change independently of code, so you need a data-versioning tool like DVC or lakeFS to pin a content hash of the data to the commit. Full reproducibility means versioning code, data, config, environment, and seeds together and linking them.

Question 3

What does experiment tracking solve, and how do MLflow and Weights and Biases differ in practice?

Accepted Answer

Experiment tracking captures the full reproducibility context of a training run — code version, hyperparameters, dataset hash, environment, and metrics — so any result can be reproduced and compared. MLflow is an open-source, self-hosted lifecycle platform; Weights and Biases is a hosted, collaboration-first product with richer real-time visualisation.

Question 4

Walk me through the full ML lifecycle from problem definition to model retirement.

Accepted Answer

The ML lifecycle spans eight phases: problem framing, data collection and validation, feature engineering, training and experimentation, offline evaluation, deployment, production monitoring, and retirement or retraining. Each phase has distinct owners, artefacts, and failure modes that an MLOps practice must systematise.

Question 5

What's the difference between experiment tracking and a model registry, and why do you need both?

Accepted Answer

Experiment tracking logs every run, its parameters, metrics, and artifacts, so you can compare and reproduce experiments during development. A model registry is the curated, governed catalog of the few models you actually intend to deploy, with versioning, stage or alias management, approvals, and lineage. You need both because tracking gives breadth for exploration while the registry gives the controlled, auditable path to production.

Question 6

Why use a pipeline orchestrator like Airflow or Kubeflow instead of cron scripts for ML workflows?

Accepted Answer

ML workflows are multi-step DAGs with dependencies, and an orchestrator gives you dependency management, retries, backfills, caching, observability, and lineage that chained cron jobs cannot. Airflow is a general-purpose task orchestrator defining DAGs in Python, while Kubeflow Pipelines is ML-native, passing typed artifacts between containerized steps on Kubernetes with conditional logic like deploy only if accuracy exceeds a threshold. Choosing depends on whether you need generic scheduling or ML-specific, container-based pipelines.

Question 7

How do you decide if a new model is actually better in production?

Accepted Answer

Offline metrics often don't predict business impact, so you run a controlled online experiment: split live traffic between the current champion and the new challenger and compare a pre-registered business metric with a statistical significance test. You size the test for adequate power, watch guardrail metrics like latency and errors, and only ship if the lift is statistically and practically significant. Variance-reduction techniques like CUPED let you reach significance faster.

Question 8

How does autoscaling work for ML inference services, and what metrics should drive it?

Accepted Answer

ML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.

Question 9

When would you choose AWS Lambda instead of ECS, and when would you choose ECS?

Accepted Answer

Choose Lambda for short, event-driven, bursty work that fits its runtime and packaging constraints and keeps durable state external. Choose ECS for long-running APIs or workers, custom containers, stable processes, and workloads needing explicit CPU, memory, networking, or runtime control. ECS can host stateless or stateful software; critical state should still be externally durable.

Question 10

What are the differences between batch, online, and streaming inference, and when should you use each?

Accepted Answer

Batch inference runs predictions on large datasets on a schedule, optimizing for throughput. Online inference serves individual requests in real time, optimizing for low latency. Streaming inference processes continuous event streams with bounded latency requirements between the two extremes.

Question 11

What is training-serving skew, and how does a feature store help prevent it?

Accepted Answer

Training-serving skew is any mismatch between how features are computed during training and how they are computed at serving time, which silently degrades a model that looked fine offline. It arises when offline and online feature logic are implemented separately, for example a rolling average computed over a different window in each path. A feature store prevents it by keeping a single feature definition used for both batch training and online serving, so the same values and logic apply in both, and it supports point-in-time-correct retrieval to avoid leakage.

Question 12

How does CI/CD for ML differ from standard software CI/CD, and what stages should an ML pipeline include?

Accepted Answer

ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.

Question 13

How do you evolve a data schema without breaking downstream ML consumers?

Accepted Answer

Use a schema registry with backward-compatible evolution rules so changes are managed rather than ad hoc: producers can add optional or nullable fields and consumers ignore unknown fields, which keeps existing pipelines working. Breaking changes such as renaming, removing, or retyping a field require versioning, often a new topic or table, with a migration window and deprecation before the old schema is retired. This lets data evolve continuously while ML features and models stay stable.

Question 14

What is a data contract, and how does it prevent ML pipelines from breaking silently?

Accepted Answer

A data contract is an explicit, enforced agreement between a data producer and consumers that specifies schema, types, semantics, and quality or freshness expectations, plus rules for how it can evolve. It prevents silent breakage by validating data at ingestion so violations are caught and quarantined or alerted instead of flowing into the model. Combined with a schema registry and backward-compatible evolution rules, it lets producers change data without unexpectedly corrupting downstream features and predictions.

Question 15

What is the difference between data drift, concept drift, and label drift — and how do you detect each?

Accepted Answer

Data drift is a change in the statistical distribution of model inputs; concept drift is a change in the relationship between inputs and the target; label drift is a shift in the marginal distribution of the target itself. They require different detectors and carry different business urgency.

Question 16

How does DVC differ from a feature store, and when would you reach for each?

Accepted Answer

DVC (and lakeFS) version raw datasets and model artifacts as immutable snapshots tied to Git commits, giving reproducibility and rollback. A feature store manages computed features for training and serving, its main job being to keep offline and online feature definitions in sync to prevent training-serving skew. They are complementary: DVC answers what data made this model, while a feature store answers how do I serve the same features consistently.

Question 17

What is the difference between a database and a cache, and how does cache-aside work?

Accepted Answer

A database normally owns durable, authoritative records and transactional constraints. A cache keeps a temporary copy optimized for repeated low-latency reads. In cache-aside, the application checks the cache, reads the database on a miss, then populates the cache with a TTL; correctness still requires invalidation, stampede protection, and database fallback.

Question 18

What is LLM model routing and how does an LLM cascade work?

Accepted Answer

Model routing sends each query to the most appropriate model based on difficulty, cost, or capability, instead of always using the largest model. A cascade is a sequential form: try the cheapest or smallest model first and only escalate to a larger model if the answer fails a quality or confidence check, reducing average cost while preserving quality on hard queries.

Question 19

How do you attribute and control ML spend across teams and models (FinOps for ML)?

Accepted Answer

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

Question 20

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Accepted Answer

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

Question 21

What is data poisoning, and why is loading a pickle model file dangerous?

Accepted Answer

Data poisoning is an attack where an adversary injects malicious or mislabeled examples into the training data to bias the model, create backdoors, or degrade it, and it is hard to detect because the model still trains successfully. Loading a pickle model is dangerous because Python's pickle executes arbitrary code on deserialization, so a malicious .pkl or .pt file from an untrusted source can run attacker code the moment you load it. Defenses include trusted data provenance and validation, and using safe formats like safetensors plus scanning model files.

Question 22

How do Docker and ONNX complement each other for packaging and deploying ML models portably?

Accepted Answer

Docker encapsulates the full runtime environment — OS libraries, Python version, system packages — so the model runs identically everywhere. ONNX provides a hardware- and framework-agnostic model format so a model trained in PyTorch can be executed by a high-performance runtime like ONNX Runtime without the training framework as a dependency.

Question 23

What is model quantization, and how does it affect quality?

Accepted Answer

Quantization stores weights and sometimes activations in lower-precision formats to cut memory and speed up inference, ranging from 16-bit (FP16 or BF16) down to INT8 and INT4. Lower precision saves more memory but can degrade accuracy; techniques like calibration, GPTQ, AWQ, and keeping sensitive layers higher-precision minimize the loss.

Question 24

How do you safely promote a model to production using a model registry?

Accepted Answer

Register every candidate as an immutable, versioned artifact, then move it through environments (dev to staging to prod) gated by automated checks rather than promoting straight to prod. In modern MLflow you use aliases like champion and challenger instead of the deprecated stage labels, and promotion is a governed, auditable action with sign-off and an easy rollback by repointing the alias. Always validate in staging and roll out progressively (canary or shadow) before full traffic.

Question 25

What is a model registry, and how does model versioning work in production ML systems?

Accepted Answer

A model registry is a centralised store that tracks every trained model artifact alongside its metadata — hyperparameters, training data version, evaluation metrics, and lineage. Versioning assigns unique identifiers to each artifact and manages lifecycle stages so teams can promote, roll back, and audit models without manual file management.

Question 26

How do you safely roll back a model in production and what triggers a rollback?

Accepted Answer

A rollback reverts serving traffic to a known-good model version when the newly deployed model shows metric regression beyond a tolerance threshold. Safe rollback requires versioned model artifacts, traffic-routing control, and pre-defined automated or manual triggers — not ad hoc decisions under pressure.

Question 27

What are the security and compatibility risks of using pickle for model serialization, and what are the safer alternatives?

Accepted Answer

Pickle executes arbitrary Python bytecode during deserialization, so loading an untrusted pickle file is equivalent to running arbitrary code on your machine. Beyond security, pickle artifacts are tightly coupled to the exact Python and library versions used to create them, making them fragile across environments.

Question 28

What metrics should you monitor for a production ML model, and at what layer?

Accepted Answer

Production ML monitoring spans four layers: data quality (schema, distributions, null rates), model behaviour (prediction drift, confidence calibration), operational health (latency, error rate, throughput), and business KPIs (conversion, revenue impact). Each layer has different owners and different alert thresholds.

Question 29

What is the difference between monitoring and distributed tracing?

Accepted Answer

Monitoring aggregates health signals such as request rate, error rate and latency percentiles across systems and time, detecting that a population is unhealthy. Distributed tracing follows one request through nested spans across services, localizing where it waited or failed. Metrics trigger investigation; traces explain individual paths.

Question 30

What does it mean for a pipeline task to be idempotent, and why does it matter for backfills and retries?

Accepted Answer

An idempotent task produces the same result whether it runs once or many times, typically by writing to a deterministic partition and overwriting rather than appending. This matters because orchestrators retry failed tasks and run backfills over historical dates, and non-idempotent tasks would double-count or corrupt data on re-runs. Designing tasks to be idempotent and partitioned by execution date makes retries and backfills safe and reproducible.

Question 31

How do you achieve reproducibility in ML training pipelines — covering seeds, environment, and data versioning?

Accepted Answer

Full ML reproducibility requires locking three layers: the random seed across all frameworks, the software environment via pinned dependency manifests or container images, and the training data via content-addressed versioning. Missing any one layer means the same code can produce different models on different runs or machines.

Question 32

What goes in a model card, and how do you provide explainability for production decisions?

Accepted Answer

A model card documents a model's intended use, training data, evaluation results broken down by relevant subgroups, known limitations, and ethical considerations, so stakeholders can judge whether and where it should be used. Explainability is provided through methods like SHAP or LIME for feature attributions, plus logging the inputs and reasons behind each decision so it can be audited or contested. Together they support transparency, oversight, and regulatory requirements for high-risk systems.

Question 33

When would you choose gRPC over REST for model serving, and what are the practical trade-offs?

Accepted Answer

gRPC uses HTTP/2 and Protocol Buffers to deliver lower latency, strongly typed contracts, and built-in streaming, making it the better choice for high-throughput internal model services. REST remains the standard for public-facing APIs where broad client compatibility and human-readable payloads matter more than raw performance.

Question 34

How do you decide when to retrain a model, and how do you do it safely?

Accepted Answer

Choose between scheduled retraining on a fixed cadence and trigger-based retraining fired by monitored drift or a performance drop, picking based on how fast the data distribution changes and how good your monitoring is. Retrain safely by treating it as an automated pipeline that validates data, trains, and gates the new model against the current champion on held-out and business metrics before promotion. Then roll out progressively with shadow or canary so a bad model never fully replaces the champion.

Question 35

What is the difference between shadow deployment and canary deployment for ML models, and when do you use each?

Accepted Answer

Shadow deployment mirrors live traffic to the new model and discards its predictions, so you can evaluate performance and load without any user impact. Canary deployment routes a small real slice of traffic to the new model and uses its predictions, so real user impact is possible but limited and monitored.

Question 36

What is the difference between a stateless and stateful service?

Accepted Answer

A stateless service instance can handle the next request without relying on its own prior request history; durable state may still live in an external database or session store. A stateful component owns evolving information needed for correctness, so replacement requires replication, checkpoint recovery, replay, or reassignment.

Question 37

What are behavioral tests for ML models (invariance, directional, and minimum-functionality tests)?

Accepted Answer

Behavioral tests check a model's input-output behavior against expectations rather than just aggregate accuracy, an idea popularized by the CheckList framework. Invariance tests assert that label-preserving perturbations do not change the prediction, directional tests assert a change moves the output the expected way, and minimum-functionality tests are simple cases the model must get right. They catch real-world failures that high overall accuracy can hide.

Question 38

How do you test an ML system, and what is the ML Test Score?

Accepted Answer

Unlike traditional software, ML systems need tests across four areas: the data, the model and training, the infrastructure and pipeline, and ongoing monitoring, because behavior depends on data, not just code. Google's ML Test Score is a rubric of 28 actionable tests across those four categories that scores a system's production readiness and technical debt. A low score flags fragile, hard-to-maintain systems even if offline accuracy looks good.

Question 39

What is a feature store and why is it critical for production ML systems?

Accepted Answer

A feature store is a shared data platform that computes, stores, and serves ML features consistently for both training and serving. It eliminates training-serving skew by ensuring the same transformation code runs in both contexts, and it reduces duplicated work by letting teams share and discover features across models.

Question 40

When and how should you trigger model retraining — scheduled vs. event-driven?

Accepted Answer

Scheduled retraining is simple and predictable but wastes compute when nothing has shifted and reacts slowly when drift is sudden. Event-driven retraining ties compute to evidence — a drift alarm, a performance threshold breach, or a data volume trigger — and is more efficient at scale. Most mature systems combine both.

Question 41

Why does a model that performed well in offline evaluation degrade in production?

Accepted Answer

Production degradation stems from distributional shift between training and serving data, upstream pipeline changes, feedback loops, and the static nature of a trained model against a changing world. Offline evaluation on a held-out slice of historical data cannot simulate these dynamics.

Question 42

When would you use a multi-armed bandit or shadow deployment instead of a fixed A/B test?

Accepted Answer

A fixed A/B test holds traffic splits constant to get a clean, statistically powered comparison, which is ideal when you need a trustworthy ship decision. A multi-armed bandit dynamically shifts traffic toward the better-performing model, reducing regret when you can't run long enough for significance or when the best arm may change. Shadow deployment sends real traffic to the new model without serving its outputs, so you validate behavior and latency risk-free before any user is exposed.

Question 43

What is the confused deputy problem in agent systems, and how does it relate to agent-to-agent authentication?

Accepted Answer

A confused deputy occurs when an agent uses its elevated permissions to perform an action on behalf of a less-privileged caller that the caller could not do directly, leading to privilege escalation. The root cause is that a trusted agent acts on natural-language requests, including from other agents, without verifying the originator's authority, so robust systems propagate identity and scope on every hop and enforce access control on agent-to-agent calls.

Question 44

How would you prevent an AI agent from leaking or misusing API credentials?

Accepted Answer

Keep raw credentials outside model context and traces. Let the model propose typed intent, authorize the final action and arguments deterministically, then have a trusted executor inject a short-lived, narrowly scoped, audience-restricted credential for one call. Re-authorize downstream and gate high-impact writes with explicit approval.

Question 45

How do you evaluate an agentic system, and what is the difference between trajectory and outcome evaluation?

Accepted Answer

Outcome evaluation checks whether the agent's final result is correct, while trajectory evaluation inspects the intermediate steps, tool calls, and decisions along the way. You need both because an agent can reach the right answer through a flawed path or fail despite sound reasoning; trajectory metrics catch wrong tool use, redundant steps, and loops that outcome-only metrics miss.

Question 46

How do you optimise GPU utilization for model serving, and what role does dynamic batching play?

Accepted Answer

GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.

Question 47

How do you monitor a model when ground-truth labels are delayed or never arrive?

Accepted Answer

When true labels are unavailable or arrive weeks late, you monitor leading indicators instead: input distribution drift, output score distribution shift, proxy business metrics, and inter-model disagreement. These act as early-warning signals before any labelled evaluation becomes possible.

Question 48

How do you balance latency and throughput trade-offs when designing a model serving system?

Accepted Answer

Latency is the time to serve a single request; throughput is the number of requests served per second. They are in tension because batching requests improves GPU utilization and throughput but adds queuing delay. The design goal is to meet the latency SLA at the highest possible throughput.

Question 49

How does LLMOps differ from classical MLOps, and what new operational challenges do LLMs introduce?

Accepted Answer

LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.

Question 50

What are the major security risks of deploying autonomous agents?

Accepted Answer

Key risks include prompt injection, especially indirect injection via tool or retrieval outputs, hijacking the agent, excessive tool permissions enabling damaging actions, data exfiltration, confused-deputy privilege escalation, and unbounded loops driving cost or harm. Mitigations include least-privilege tools, sandboxing, input and output guardrails, human-in-the-loop approval for sensitive actions, and audit logging.