How do you choose between batch and real-time inference for a model?

Decide based on how fresh the prediction must be versus the cost and complexity of serving live. Use batch when results are needed every few hours or days, like daily churn lists, because it is cheap, simple, and can use spot or scheduled compute. Use real-time when a late or stale decision causes immediate loss, like fraud or ad auctions needing sub-100ms responses, accepting higher cost and complexity. Most production systems are hybrid: precompute heavy signals offline and do lightweight re-ranking online.

What is training-serving skew, and how does a feature store help prevent it?

Training-serving skew is any mismatch between how features are computed during training and how they are computed at serving time, which silently degrades a model that looked fine offline. It arises when offline and online feature logic are implemented separately, for example a rolling average computed over a different window in each path. A feature store prevents it by keeping a single feature definition used for both batch training and online serving, so the same values and logic apply in both, and it supports point-in-time-correct retrieval to avoid leakage.

What are the differences between batch, online, and streaming inference, and when should you use each?

Batch inference runs predictions on large datasets on a schedule, optimizing for throughput. Online inference serves individual requests in real time, optimizing for low latency. Streaming inference processes continuous event streams with bounded latency requirements between the two extremes.

What is the difference between batch and streaming data pipelines, and how do you choose between them?

Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.

Batch vs real-time inference — MLOps

We just spent two lessons optimizing the live request path — FastAPI’s lifespan loads, BentoML’s adaptive batching, GPU schedulers — all to shave milliseconds off a prediction made while a user waits. And then we noticed the assumption underneath all of it: that the prediction needs to be made while the user waits at all. For most predictions, it doesn’t. This lesson is the cheaper question we skipped.

When people picture “serving a model,” they imagine a live API answering each request in milliseconds. But most predictions in production are not made that way — and choosing the wrong pattern either wastes a fortune or misses your latency target. The decision is driven by two questions: how fresh must the prediction be, and how many do you make?

The four patterns

Serving patterns by how fresh the prediction must be — from cheap nightly batch to per-request real-time.

Batch — score a whole table on a schedule (nightly, hourly) and write results to a database the app reads. The cheapest option, and the right one whenever predictions don’t need to reflect this instant: churn scores, lead rankings, recommendation precomputes, risk models. ~90% of use cases can be batch.
Real-time (online) — a live API scores each request in milliseconds. Necessary when the input only exists at request time and the answer is needed now: fraud on a transaction, search ranking, dynamic pricing. The most expensive and operationally demanding.
Async — the request is queued and the result returned in seconds-to-minutes (a slow model, a document to process). Decouples a heavy job from the user’s request.
Streaming — continuous scoring over an event stream (Kafka/Flink) for always-on use cases like real-time anomaly detection on telemetry.

The hybrid that wins

You don’t always have to choose. A common, powerful pattern: precompute expensive features (or even predictions) in batch, store them in a fast key-value store (Redis), and serve them live with a thin real-time layer. This is exactly what a feature store does for the online path — batch freshness, real-time latency.

In one breath

The serving pattern follows from two questions — how fresh must the prediction be, and how many do you make — across four options: batch (score a whole table on a schedule, write to a DB the app reads — cheapest, and right for ~90% of cases like churn scores and recommendations), real-time (a live API, milliseconds per request, only when the input exists at request time and the answer can’t wait), async (queued, seconds to minutes, for heavy jobs), and streaming (continuous, on an event stream); and the hybrid that often wins is to precompute in batch, store in Redis, serve from a thin real-time layer — batch freshness at real-time latency.

Practice

Before the quiz, sort four workloads into patterns and justify each: (a) nightly churn scores shown on an internal dashboard, (b) fraud check on a card swipe, (c) a 30-second document-summarization model, (d) anomaly detection on a continuous sensor feed. Then the cost argument: the lesson says ~90% of use cases can be batch and that batch is ~10× cheaper — so why is “default to batch, earn your way to real-time” the financially correct instinct, and what does the hybrid precompute pattern let you avoid paying for?

Quick check

0/3

Q1What two factors primarily drive the choice of inference pattern?

Q2Why prefer batch inference when it's viable?

Q3What does the hybrid precompute pattern do?

A question to carry forward

Whichever pattern you picked — and especially if it was real-time — notice the thing we have quietly never done in this whole chapter: change the model. We have a live service, batched and scaled and pattern-matched to its latency need, serving version 1. But models don’t stay at version 1. Retraining produces version 2, and it has to take over from version 1 while the service is up, answering real traffic, with no window to go dark.

That is its own discipline, and getting it wrong is how a “better” model takes down production at peak. So the question to carry forward is: when you have a new model version and a live service that can’t stop, how do you swap one for the other safely — testing the new one on real traffic before you trust it, and pulling it instantly if it misbehaves? Those are deployment strategies — canary, blue-green, shadow — and they are the next lesson.

Batch vs real-time inference

What you'll learn

Before you start

The four patterns

The hybrid that wins

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further