datarekha
Data Engineering Medium Asked at KafkaAsked at GoogleAsked at MetaAsked at UberAsked at LinkedIn

What is the difference between batch and streaming data pipelines, and how do you choose between them?

The short answer

Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.

How to think about it

The choice between batch and streaming is fundamentally about acceptable latency and operational complexity trade-offs — not about which is technically superior.

Batch pipelines

A batch job reads a bounded dataset (yesterday’s files, the last hour of database rows), transforms it, and writes results. Tools: Spark, dbt, Airflow-scheduled SQL.

Strengths:

  • Deterministic, easy to test locally with a sample file.
  • Failures are obvious: the job either succeeds or fails; rerunning a date partition is straightforward.
  • Cheap — compute clusters can be spun down between runs.

Weaknesses:

  • Data is stale by definition. A nightly batch means dashboards are 24 hours behind.
  • Large windows mean large blast radius when a job fails.

Streaming pipelines

Records are processed within milliseconds of creation. Tools: Apache Kafka + Flink, Spark Structured Streaming, Google Dataflow (Beam).

# Flink-style windowed aggregation (pseudo-code)
stream = env.from_source(kafka_source)
stream \
    .key_by(lambda e: e["user_id"]) \
    .window(TumblingEventTimeWindows.of(Time.minutes(5))) \
    .aggregate(RevenueAggregator()) \
    .add_sink(sink)

Strengths:

  • Sub-second latency enables real-time fraud detection, live dashboards, recommendation freshness.
  • Smaller, continuous writes reduce end-of-day spikes.

Weaknesses:

  • Late-arriving events require watermarks and allowances for out-of-order data.
  • Stateful operators need durable checkpoints; failure recovery is more complex than batch retry.
  • Harder to test: you need a Kafka topic with realistic event rates to reproduce bugs.

Micro-batch as a middle ground

Spark Structured Streaming and Delta Live Tables run micro-batches every 30–300 seconds. You get near-real-time freshness with batch-like semantics — a reasonable choice for most analytics workloads that don’t need sub-second latency.

Keep practising

All Data Engineering questions

Explore further

Skip to content