datarekha
MLOps Easy

How do you choose between batch and real-time inference for a model?

The short answer

Decide based on how fresh the prediction must be versus the cost and complexity of serving live. Use batch when results are needed every few hours or days, like daily churn lists, because it is cheap, simple, and can use spot or scheduled compute. Use real-time when a late or stale decision causes immediate loss, like fraud or ad auctions needing sub-100ms responses, accepting higher cost and complexity. Most production systems are hybrid: precompute heavy signals offline and do lightweight re-ranking online.

How to think about it

The short answer

It comes down to one trade-off: the marginal value of freshness versus the marginal cost and complexity of serving live. If a prediction can be hours or days old, use batch. If a late or stale decision causes immediate loss, use real-time. In practice, most large systems are hybrid.

When to use batch

Batch (offline) inference runs predictions on bulk data on a schedule — hourly/daily jobs. It’s cheap and simple: you can use spot instances and scheduled compute, and there’s no always-on latency-critical service. Choose it when acceptable freshness is measured in hours/days — e.g., a weekly churn-propensity list or daily recommendation candidate generation.

When to use real-time

Real-time (online) inference responds on demand, typically within milliseconds. It optimizes for tail latency and freshness but costs 10–100x more per prediction due to always-on capacity and is operationally more complex. Choose it when per-interaction value is high and wrong/late decisions cost immediately — fraud gating that must decide in ~50ms, or ad auctions with a 100ms budget.

The hybrid pattern (most common)

Most production ML precomputes the heavy, slow-changing signals offline (embeddings, candidate sets) and does lightweight contextualization or re-ranking online. YouTube and LinkedIn feed ranking work this way: batch generates candidates; an online re-ranker adds real-time context within the latency budget.

Concrete example

A recommender precomputes per-user candidate items nightly (batch), then at request time a small online model re-ranks them using the current session — cheap where it can be, fresh where it must be.

Common follow-up / trap

The big trap is training-serving skew: if your batch training features are computed differently from your online serving features, the model degrades in production. The mitigation is a unified pipeline / feature store so the same feature logic runs offline and online. Interviewers love to follow “batch or real-time?” with “…and how do you keep features consistent across them?”

Learn it properly Batch vs real-time inference

Keep practising

All MLOps questions

Explore further

Skip to content