datarekha

Batch vs real-time inference

Most predictions don't need to be real-time. Choosing the serving pattern — batch, real-time, async, streaming — by latency SLA and volume, and the hybrid that precomputes then serves live.

6 min read Intermediate MLOps Lesson 14 of 28

What you'll learn

  • The serving patterns — batch, real-time, async, streaming
  • Choosing by latency SLA and request volume (and cost)
  • The hybrid precompute pattern that gets the best of both

Before you start

When people picture “serving a model,” they imagine a live API answering each request in milliseconds. But most predictions in production are not made that way — and choosing the wrong pattern either wastes a fortune or misses your latency target. The decision is driven by two questions: how fresh must the prediction be, and how many do you make?

The four patterns

latency need: high → lowtolerable delay: hours → millisecondsBatchnightly, cheapestAsyncseconds–minutes, queuedReal-timeper request, msStreamingcontinuous, on events
Serving patterns by how fresh the prediction must be — from cheap nightly batch to per-request real-time.
  • Batch — score a whole table on a schedule (nightly, hourly) and write results to a database the app reads. The cheapest option, and the right one whenever predictions don’t need to reflect this instant: churn scores, lead rankings, recommendation precomputes, risk models. ~90% of use cases can be batch.
  • Real-time (online) — a live API scores each request in milliseconds. Necessary when the input only exists at request time and the answer is needed now: fraud on a transaction, search ranking, dynamic pricing. The most expensive and operationally demanding.
  • Async — the request is queued and the result returned in seconds-to-minutes (a slow model, a document to process). Decouples a heavy job from the user’s request.
  • Streaming — continuous scoring over an event stream (Kafka/Flink) for always-on use cases like real-time anomaly detection on telemetry.

The hybrid that wins

You don’t always have to choose. A common, powerful pattern: precompute expensive features (or even predictions) in batch, store them in a fast key-value store (Redis), and serve them live with a thin real-time layer. This is exactly what a feature store does for the online path — batch freshness, real-time latency.

Quick check

Quick check

0/3
Q1What two factors primarily drive the choice of inference pattern?
Q2Why prefer batch inference when it's viable?
Q3What does the hybrid precompute pattern do?

Next

That completes the MLOps serving picture. Pair this with feature stores for the online path and FinOps for the cost side.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How do you choose between batch and real-time inference for a model?

Decide based on how fresh the prediction must be versus the cost and complexity of serving live. Use batch when results are needed every few hours or days, like daily churn lists, because it is cheap, simple, and can use spot or scheduled compute. Use real-time when a late or stale decision causes immediate loss, like fraud or ad auctions needing sub-100ms responses, accepting higher cost and complexity. Most production systems are hybrid: precompute heavy signals offline and do lightweight re-ranking online.

What are the differences between batch, online, and streaming inference, and when should you use each?

Batch inference runs predictions on large datasets on a schedule, optimizing for throughput. Online inference serves individual requests in real time, optimizing for low latency. Streaming inference processes continuous event streams with bounded latency requirements between the two extremes.

What is the difference between batch and streaming data pipelines, and how do you choose between them?

Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.

How do you optimise GPU utilization for model serving, and what role does dynamic batching play?

GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.

Related lessons

Explore further

Skip to content