Walk me through the full ML lifecycle from problem definition to model retirement.
The ML lifecycle spans eight phases: problem framing, data collection and validation, feature engineering, training and experimentation, offline evaluation, deployment, production monitoring, and retirement or retraining. Each phase has distinct owners, artefacts, and failure modes that an MLOps practice must systematise.
How to think about it
The lifecycle question is a systems-thinking test. Interviewers want to see that you understand handoffs, failure modes, and feedback loops — not just training code.
Phase 1: Problem framing
Define the business metric, the proxy ML objective, constraints (latency, interpretability, fairness), and the minimum acceptable offline performance that justifies deployment cost. A badly framed objective (optimising click probability when the business cares about revenue per session) wastes all subsequent effort.
Phase 2: Data collection and validation
Source raw data, define labelling strategy, audit for bias and leakage. Label a representative sample; establish a data contract (schema, freshness SLAs) with upstream owners. Document lineage.
Phase 3: Feature engineering
Build features using point-in-time correct joins to avoid leakage. Encode features in a shared feature store if the organisation has one. Version the feature set as an artefact alongside the model.
Phase 4: Training and experimentation
Run tracked experiments (MLflow, W&B) with versioned code, data hashes, and hyperparameters logged. Use cross-validation for high-variance datasets. Document the champion selection rationale.
Phase 5: Offline evaluation
Evaluate on a held-out test set drawn from a later time period than the training window (temporal split, not random split). Measure disaggregated performance across demographic or segment slices. Run fairness audits where applicable.
Phase 6: Deployment
Package the model (container, ONNX, or framework-native format). Shadow deploy or canary the new model; gate on latency SLAs and business KPIs. Register the promoted model in a model registry with all metadata.
Phase 7: Production monitoring
Monitor data quality, prediction drift, operational metrics, and business KPIs continuously. Set alert thresholds. Maintain an inference log for retrospective evaluation when labels arrive.
Phase 8: Retraining and retirement
Retrain on a defined trigger (scheduled or event-driven). Run the same offline evaluation and deployment gate before promotion. Retire the model when a replacement is fully promoted, or when the problem itself no longer exists — document the decision.