Machine Learning Hard Asked at GoogleAsked at MetaAsked at AirbnbAsked at UberAsked at NetflixAsked at Stripe

Your model performs well offline but degrades in production. How do you diagnose and fix it?

For ML Engineer Data Scientist AI / LLM Engineer

The short answer

The most common cause is training-serving skew: the distribution of features at serving time differs from the training data. The fix requires instrumenting the pipeline to log serving inputs, compare their distribution to training data, and identify whether the gap is due to data drift, feature engineering bugs, label leakage, or infrastructure inconsistencies.

How to think about it

This is the most practical ML question you will face in a senior interview. Interviewers want a structured diagnosis framework, not a list of buzzwords.

Step 1: confirm the gap is real

Check whether offline and online metrics are measuring the same thing. Offline evaluation on a held-out split uses batch labels; online evaluation is often implicit (clicks, conversions, churn). If the metrics themselves differ, the “gap” may not be a model problem at all.

Step 2: log and compare feature distributions

Capture the actual feature vectors being sent to the model at serving time. Compare their statistical distribution (mean, std, percentiles, null rates) against the training set.

Tools: EvidentlyAI, WhyLogs, or a simple KL-divergence / Population Stability Index (PSI) per feature. PSI > 0.2 for a feature signals significant drift.

Step 3: identify the root cause

Training-serving skew — The most common culprit. A feature is computed differently in the training pipeline versus the serving pipeline. Classic example: average transaction value in training is computed over the full history; at serving time it is computed only over the last 30 days. Fix by unifying the feature computation code (feature stores solve this structurally).

Data drift — The world changed. User behaviour shifted, a new product launched, a seasonal pattern appeared. Fix by retraining on recent data or adding recency weighting.

Label leakage — A feature used in training incorporates future information. Offline metrics look inflated; production metrics reveal the true, lower performance. Fix by auditing feature timestamps relative to the label event.

Feedback loops — Model predictions influence future training data (e.g., a ranking model trained on clicks that it itself generated). Fix with counterfactual logging or randomisation.

Infrastructure bugs — Feature type mismatch (int vs float), unexpected nulls, different null-filling logic, wrong model version loaded. Check model version logs and add input validation (pydantic schemas, Great Expectations).

Step 4: monitor continuously

# Minimal drift detection with PSI
import numpy as np

def psi(expected, actual, buckets=10):
    breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
    def bucket(x):
        counts, _ = np.histogram(x, bins=breakpoints)
        return (counts + 1e-6) / len(x)   # avoid log(0)
    e, a = bucket(expected), bucket(actual)
    return np.sum((a - e) * np.log(a / e))

for feature in feature_names:
    score = psi(train_df[feature], serving_df[feature])
    if score > 0.2:
        print(f"Drift alert: {feature}  PSI={score:.3f}")

Your model performs well offline but degrades in production. How do you diagnose and fix it?

Step 1: confirm the gap is real

Step 2: log and compare feature distributions

Step 3: identify the root cause

Step 4: monitor continuously

Keep practising

Explore further