Batch vs real-time inference
Most predictions don't need to be real-time. Choosing the serving pattern — batch, real-time, async, streaming — by latency SLA and volume, and the hybrid that precomputes then serves live.
What you'll learn
- The serving patterns — batch, real-time, async, streaming
- Choosing by latency SLA and request volume (and cost)
- The hybrid precompute pattern that gets the best of both
Before you start
When people picture “serving a model,” they imagine a live API answering each request in milliseconds. But most predictions in production are not made that way — and choosing the wrong pattern either wastes a fortune or misses your latency target. The decision is driven by two questions: how fresh must the prediction be, and how many do you make?
The four patterns
- Batch — score a whole table on a schedule (nightly, hourly) and write results to a database the app reads. The cheapest option, and the right one whenever predictions don’t need to reflect this instant: churn scores, lead rankings, recommendation precomputes, risk models. ~90% of use cases can be batch.
- Real-time (online) — a live API scores each request in milliseconds. Necessary when the input only exists at request time and the answer is needed now: fraud on a transaction, search ranking, dynamic pricing. The most expensive and operationally demanding.
- Async — the request is queued and the result returned in seconds-to-minutes (a slow model, a document to process). Decouples a heavy job from the user’s request.
- Streaming — continuous scoring over an event stream (Kafka/Flink) for always-on use cases like real-time anomaly detection on telemetry.
The hybrid that wins
You don’t always have to choose. A common, powerful pattern: precompute expensive features (or even predictions) in batch, store them in a fast key-value store (Redis), and serve them live with a thin real-time layer. This is exactly what a feature store does for the online path — batch freshness, real-time latency.
Quick check
Quick check
Next
That completes the MLOps serving picture. Pair this with feature stores for the online path and FinOps for the cost side.
Practice this in an interview
All questionsDecide based on how fresh the prediction must be versus the cost and complexity of serving live. Use batch when results are needed every few hours or days, like daily churn lists, because it is cheap, simple, and can use spot or scheduled compute. Use real-time when a late or stale decision causes immediate loss, like fraud or ad auctions needing sub-100ms responses, accepting higher cost and complexity. Most production systems are hybrid: precompute heavy signals offline and do lightweight re-ranking online.
Batch inference runs predictions on large datasets on a schedule, optimizing for throughput. Online inference serves individual requests in real time, optimizing for low latency. Streaming inference processes continuous event streams with bounded latency requirements between the two extremes.
Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.
GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.