What are the differences between batch, online, and streaming inference, and when should you use each?
Batch inference runs predictions on large datasets on a schedule, optimizing for throughput. Online inference serves individual requests in real time, optimizing for low latency. Streaming inference processes continuous event streams with bounded latency requirements between the two extremes.
How to think about it
Batch inference runs a model over a large dataset on a schedule (hourly, nightly). Predictions are stored and looked up later. Ideal when results are not needed at request time — recommendation pre-computation, fraud scoring on daily transactions, churn propensity scores.
Online inference serves a single request synchronously with strict SLA requirements (p99 under 100 ms is common). Used when the model needs context only available at request time — real-time fraud detection, search ranking, autocomplete.
Streaming inference consumes from a message queue (Kafka, Kinesis) and scores records as they arrive. Sits between batch and online: sub-second latency, higher throughput than pure request/response. Common for click-stream feature aggregation or IoT anomaly detection.
# Batch: score a DataFrame offline
import pandas as pd, joblib
model = joblib.load("model.pkl")
df = pd.read_parquet("s3://data/features/2026-06-06.parquet")
df["score"] = model.predict_proba(df[FEATURE_COLS])[:, 1]
df.to_parquet("s3://data/scores/2026-06-06.parquet")
# Online: FastAPI endpoint
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class Request(BaseModel):
features: list[float]
@app.post("/predict")
def predict(req: Request):
score = model.predict_proba([req.features])[0, 1]
return {"score": float(score)}