What is the difference between batch and streaming data pipelines, and how do you choose between them?

Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.

Why use a pipeline orchestrator like Airflow or Kubeflow instead of cron scripts for ML workflows?

ML workflows are multi-step DAGs with dependencies, and an orchestrator gives you dependency management, retries, backfills, caching, observability, and lineage that chained cron jobs cannot. Airflow is a general-purpose task orchestrator defining DAGs in Python, while Kubeflow Pipelines is ML-native, passing typed artifacts between containerized steps on Kubernetes with conditional logic like deploy only if accuracy exceeds a threshold. Choosing depends on whether you need generic scheduling or ML-specific, container-based pipelines.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

What chunking strategies exist for RAG and how do you choose between them?

Chunking splits source documents into retrievable units before embedding. The right strategy depends on document structure, query style, and the model's context window. Fixed-size chunks are simple but break mid-sentence; semantic or structural chunking preserves coherence; hierarchical chunking enables parent-document retrieval for richer context.

Queues & batch pipelines — Generative AI

The naive approach fails immediately

Your first instinct might be to write a loop: iterate over 50,000 doc IDs in a single HTTP request handler, call the LLM for each one, and return when done. That fails hard:

Timeout — HTTP requests time out in seconds to minutes; 50k LLM calls take hours.
No partial recovery — if the server crashes on doc 30,000 you restart from zero.
No progress visibility — the client hangs with zero feedback.
Rate limits — firing all 50k calls in a tight loop immediately hits the provider’s tokens-per-minute cap.

The solution is to make the work asynchronous and durable by routing it through a queue.

The kitchen analogy

Imagine a busy restaurant. Customers hand orders to a cashier — the producer — who clips them to a rail above the kitchen — the queue. Cooks — the workers — pull tickets at whatever pace the kitchen can handle. The cashier doesn’t wait for the food; the customer gets a buzzer and comes back when it’s ready. If a cook burns a dish (worker crash), the ticket is still on the rail for another cook to retry.

That rail is a message queue. It decouples the rate at which work arrives from the rate at which it gets done.

Queue or stream? Start with the consumer contract

A queue is the clearest fit when one worker pool should own and complete each job. Workers receive or lease messages, acknowledge success, retry failure and eventually send poison messages to a DLQ.

A retained event stream is the clearer fit when several independent applications must read the same ordered history, maintain their own offsets and replay old events after code changes. Product capabilities overlap, so choose by retention, replay, ordering and consumer-group semantics—not by whether a vendor calls its product a queue or stream.

The pipeline

In software terms:

Producer — your API endpoint or script receives the 50k doc IDs and enqueues one job per doc (SQS, RabbitMQ, Redis Streams, Kafka, or Celery — all follow the same model). This takes seconds, not hours.
Queue — durable storage for jobs; survives restarts; buffers bursts.
Worker pool — N processes (W1 … W4) each pull one job, call the LLM, and write the result to a results store (database, object storage).
Results store — where finished summaries land; your client polls here or receives a webhook.

Your worker count is the concurrency knob that determines throughput. Add workers to go faster; reduce workers to respect the LLM’s rate limit. The queue absorbs the difference.

The async pipeline. The producer returns instantly; workers drain the queue at the rate your LLM tier allows.

Why this architecture wins

Property	Synchronous loop	Queue + workers
Survives server crash	No — restart from zero	Yes — unacked jobs reappear
Scales processing	Vertical only	Add workers horizontally
Rate-limit control	Hard	Worker count = concurrency knob
Client feedback	None until done	Poll / webhook on partial progress
Burst handling	OOMs or times out	Queue absorbs the burst

Delivery semantics and idempotency

Most queues (SQS, RabbitMQ, Redis) give you at-least-once delivery: a message is guaranteed to be delivered, but it may be delivered more than once.

Here is when that happens: a worker receives a job (the message becomes invisible for a configurable visibility timeout), calls the LLM, writes the result — and then crashes before sending the ack (acknowledgement / delete). The visibility timeout lapses, the message reappears, and another worker processes it again.

The fix is to make your worker idempotent: keyed on the document ID, processing a doc twice produces one result entry, not two. The pattern is dead simple:

def process_job(doc_id: str):
    # Idempotency check first
    if results_store.exists(doc_id):
        return  # already done — safe to ack and skip
    summary = call_llm(doc_id)
    results_store.set(doc_id, summary)
    # ack the message only after writing
    queue.ack(doc_id)

The worker flow in full:

Receive — message is fetched; visibility timeout starts.
Check idempotency — is this doc already in the results store?
Process — call the LLM.
Write result — persist to the store before acking.
Ack / delete — remove the message from the queue; it will never reappear.
Worker dies at step 3 or 4? — timeout lapses, message reappears, another worker retries from step 1 safely.

Retries, backoff, and the dead-letter queue

Not every failure is transient. A doc might be corrupted, too long for the context window, or trigger a content filter. If you retry blindly and forever, one bad doc ties up a worker slot endlessly and blocks the rest of the queue — the poison message problem.

The standard pattern:

On a transient error (rate limit, network blip): retry with exponential backoff — wait 2 s, 4 s, 8 s… Most queue systems let you configure this per-queue or handle it in your worker.
After N attempts (3–5 is common): route the message to the dead-letter queue (DLQ). The DLQ is a separate queue you inspect later. It captures everything that couldn’t be processed without blocking the main pipeline.

Success acks immediately. Transient failures retry with exponential backoff. After 3 failures the job moves to the DLQ — out of the main pipeline so other jobs can proceed.

The DLQ is your safety valve. You inspect it on a schedule, fix the underlying issue (too-long document, bad encoding, content filter), and re-drive the messages back to the main queue.

Backpressure

A subtle but critical concept: what happens when your producer enqueues jobs faster than workers can process them? If you store all pending work in an in-memory list, the producer eventually runs out of RAM — a classic OOM crash.

The fix is backpressure: the queue is bounded and the producer blocks (or slows) when it is full. Workers pull work at their own pace; they never have work pushed onto them faster than they can handle. The queue acts as the buffer that absorbs bursts without letting them cascade into crashes.

Backpressure (left): the bounded queue signals the producer to slow when full, keeping memory under control. No backpressure (right): an unbounded in-memory list grows without limit until the process crashes.

In practice: use a proper queue library (Celery with Redis, SQS, etc.) and let it manage the bounded buffer. Never accumulate all 50k job objects in a Python list in memory before dispatching.

Throughput and cost: the Batch API

For non-realtime bulk work, check whether your LLM provider offers a Batch API. Both Anthropic and OpenAI offer asynchronous batch endpoints:

Roughly 50% cheaper than the synchronous API.
Higher rate limits (the provider schedules your batch across off-peak capacity).
Results returned within 24 hours via a file you download.

The workflow: package all 50k prompts into a single batch file, submit, poll for completion, download results. You still want a queue layer for progress tracking, retries on failed items, and chunk-level observability — but the actual LLM calls go through the batch endpoint.

# Pseudocode: submit a batch of summarization requests
batch = client.batches.create(
    requests=[
        {"custom_id": doc_id, "method": "POST", "url": "/v1/messages",
         "body": {"model": "claude-sonnet-4-5", "max_tokens": 256,
                  "messages": [{"role": "user",
                                "content": f"Summarize: {doc_text}"}]}}
        for doc_id, doc_text in chunk_of_docs
    ]
)
# Poll until batch.processing_status == "ended"
# Then: results = client.batches.results(batch.id)

Use Batch API for overnight runs, reporting pipelines, and any processing where a 24-hour window is acceptable. Use the synchronous queue pattern when you need lower latency or interactive progress.

Putting it all together

Here is what the full reliable pipeline looks like:

1. Producer: iterate doc IDs → enqueue job per doc (SQS/Redis/etc.)
2. Worker pool: N workers each:
     a. receive (visibility timeout starts)
     b. idempotency check (skip if already in store)
     c. call LLM (or Batch API for off-peak)
     d. write result to store
     e. ack → message deleted
     on transient error: exponential backoff, re-enqueue
     after 3 failures: route to DLQ
3. Progress tracker: count done / failed; expose endpoint for polling
4. DLQ monitor: alert on DLQ depth; re-drive after fixing root cause

The key numbers to tune for your pipeline: worker count (throughput vs rate limit), visibility timeout (longer than your 99th-percentile LLM call), max retry count (3–5), and DLQ alarm threshold.

Glossary

Producer — the code that enqueues work onto the queue.
Consumer / Worker — the process that pulls jobs from the queue and processes them.
Ack (acknowledgement) — the signal a worker sends to permanently delete a message after successful processing.
Visibility timeout — how long a received message stays invisible to other workers; if the worker doesn’t ack within this window, the message reappears.
At-least-once delivery — the queue guarantees delivery, but may deliver the same message more than once.
Idempotent — an operation that produces the same result no matter how many times it runs. Required when delivery is at-least-once.
Dead-letter queue (DLQ) — a separate queue that receives messages that have exceeded the maximum retry count.
Backpressure — the mechanism by which a slow consumer signals a fast producer to slow down, preventing memory overflow.

Quick check

0/3

Q1A worker receives a doc, calls the LLM, writes the summary — then crashes before sending the ack. What happens next, and why does idempotency matter?

Q2Why do you route a message to the dead-letter queue after N failures instead of retrying forever?

Q3When should you prefer the provider's Batch API over a synchronous worker-per-doc approach?

Queues & batch pipelines

What you'll learn

Before you start

The naive approach fails immediately

The kitchen analogy

Queue or stream? Start with the consumer contract

The pipeline

Why this architecture wins

Delivery semantics and idempotency

Retries, backoff, and the dead-letter queue

Backpressure

Throughput and cost: the Batch API

Putting it all together

Glossary

Quick check

Quick check

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further