datarekha
MLOps Easy Asked at GoogleAsked at AmazonAsked at MicrosoftAsked at Palantir

Walk me through the full ML lifecycle from problem definition to model retirement.

The short answer

The ML lifecycle spans eight phases: problem framing, data collection and validation, feature engineering, training and experimentation, offline evaluation, deployment, production monitoring, and retirement or retraining. Each phase has distinct owners, artefacts, and failure modes that an MLOps practice must systematise.

How to think about it

The lifecycle question is a systems-thinking test. Interviewers want to see that you understand handoffs, failure modes, and feedback loops — not just training code.

Phase 1: Problem framing

Define the business metric, the proxy ML objective, constraints (latency, interpretability, fairness), and the minimum acceptable offline performance that justifies deployment cost. A badly framed objective (optimising click probability when the business cares about revenue per session) wastes all subsequent effort.

Phase 2: Data collection and validation

Source raw data, define labelling strategy, audit for bias and leakage. Label a representative sample; establish a data contract (schema, freshness SLAs) with upstream owners. Document lineage.

Phase 3: Feature engineering

Build features using point-in-time correct joins to avoid leakage. Encode features in a shared feature store if the organisation has one. Version the feature set as an artefact alongside the model.

Phase 4: Training and experimentation

Run tracked experiments (MLflow, W&B) with versioned code, data hashes, and hyperparameters logged. Use cross-validation for high-variance datasets. Document the champion selection rationale.

Phase 5: Offline evaluation

Evaluate on a held-out test set drawn from a later time period than the training window (temporal split, not random split). Measure disaggregated performance across demographic or segment slices. Run fairness audits where applicable.

Phase 6: Deployment

Package the model (container, ONNX, or framework-native format). Shadow deploy or canary the new model; gate on latency SLAs and business KPIs. Register the promoted model in a model registry with all metadata.

Phase 7: Production monitoring

Monitor data quality, prediction drift, operational metrics, and business KPIs continuously. Set alert thresholds. Maintain an inference log for retrospective evaluation when labels arrive.

Phase 8: Retraining and retirement

Retrain on a defined trigger (scheduled or event-driven). Run the same offline evaluation and deployment gate before promotion. Retire the model when a replacement is fully promoted, or when the problem itself no longer exists — document the decision.

Keep practising

All MLOps questions

Explore further

Skip to content