How do you choose between batch and real-time inference for a model?
Decide based on how fresh the prediction must be versus the cost and complexity of serving live. Use batch when results are needed every few hours or days, like daily churn lists, because it is cheap, simple, and can use spot or scheduled compute. Use real-time when a late or stale decision causes immediate loss, like fraud or ad auctions needing sub-100ms responses, accepting higher cost and complexity. Most production systems are hybrid: precompute heavy signals offline and do lightweight re-ranking online.
How to think about it
The short answer
It comes down to one trade-off: the marginal value of freshness versus the marginal cost and complexity of serving live. If a prediction can be hours or days old, use batch. If a late or stale decision causes immediate loss, use real-time. In practice, most large systems are hybrid.
When to use batch
Batch (offline) inference runs predictions on bulk data on a schedule — hourly/daily jobs. It’s cheap and simple: you can use spot instances and scheduled compute, and there’s no always-on latency-critical service. Choose it when acceptable freshness is measured in hours/days — e.g., a weekly churn-propensity list or daily recommendation candidate generation.
When to use real-time
Real-time (online) inference responds on demand, typically within milliseconds. It optimizes for tail latency and freshness but costs 10–100x more per prediction due to always-on capacity and is operationally more complex. Choose it when per-interaction value is high and wrong/late decisions cost immediately — fraud gating that must decide in ~50ms, or ad auctions with a 100ms budget.
The hybrid pattern (most common)
Most production ML precomputes the heavy, slow-changing signals offline (embeddings, candidate sets) and does lightweight contextualization or re-ranking online. YouTube and LinkedIn feed ranking work this way: batch generates candidates; an online re-ranker adds real-time context within the latency budget.
Concrete example
A recommender precomputes per-user candidate items nightly (batch), then at request time a small online model re-ranks them using the current session — cheap where it can be, fresh where it must be.
Common follow-up / trap
The big trap is training-serving skew: if your batch training features are computed differently from your online serving features, the model degrades in production. The mitigation is a unified pipeline / feature store so the same feature logic runs offline and online. Interviewers love to follow “batch or real-time?” with “…and how do you keep features consistent across them?”