Queues & batch pipelines
50,000 documents won't go through your summarization pipeline in one request. How queues, workers, retries, and backpressure get the job done.
What you'll learn
- Why large/slow jobs are enqueued, not handled in the request
- Producer to queue to worker pool: decoupling and independent scaling
- At-least-once delivery, idempotency, acks and visibility timeouts
- Retries with backoff and the dead-letter queue for poison messages
- Backpressure and the Batch API for cheap, high-throughput processing
Before you start
The naive approach fails immediately
Your first instinct might be to write a loop: iterate over 50,000 doc IDs in a single HTTP request handler, call the LLM for each one, and return when done. That fails hard:
- Timeout — HTTP requests time out in seconds to minutes; 50k LLM calls take hours.
- No partial recovery — if the server crashes on doc 30,000 you restart from zero.
- No progress visibility — the client hangs with zero feedback.
- Rate limits — firing all 50k calls in a tight loop immediately hits the provider’s tokens-per-minute cap.
The solution is to make the work asynchronous and durable by routing it through a queue.
The kitchen analogy
Imagine a busy restaurant. Customers hand orders to a cashier — the producer — who clips them to a rail above the kitchen — the queue. Cooks — the workers — pull tickets at whatever pace the kitchen can handle. The cashier doesn’t wait for the food; the customer gets a buzzer and comes back when it’s ready. If a cook burns a dish (worker crash), the ticket is still on the rail for another cook to retry.
That rail is a message queue. It decouples the rate at which work arrives from the rate at which it gets done.
The pipeline
In software terms:
- Producer — your API endpoint or script receives the 50k doc IDs and enqueues one job per doc (SQS, RabbitMQ, Redis Streams, Kafka, or Celery — all follow the same model). This takes seconds, not hours.
- Queue — durable storage for jobs; survives restarts; buffers bursts.
- Worker pool — N processes (W1 … W4) each pull one job, call the LLM, and write the result to a results store (database, object storage).
- Results store — where finished summaries land; your client polls here or receives a webhook.
Your worker count is the concurrency knob that determines throughput. Add workers to go faster; reduce workers to respect the LLM’s rate limit. The queue absorbs the difference.
Why this architecture wins
| Property | Synchronous loop | Queue + workers |
|---|---|---|
| Survives server crash | No — restart from zero | Yes — unacked jobs reappear |
| Scales processing | Vertical only | Add workers horizontally |
| Rate-limit control | Hard | Worker count = concurrency knob |
| Client feedback | None until done | Poll / webhook on partial progress |
| Burst handling | OOMs or times out | Queue absorbs the burst |
Delivery semantics and idempotency
Most queues (SQS, RabbitMQ, Redis) give you at-least-once delivery: a message is guaranteed to be delivered, but it may be delivered more than once.
Here is when that happens: a worker receives a job (the message becomes invisible for a configurable visibility timeout), calls the LLM, writes the result — and then crashes before sending the ack (acknowledgement / delete). The visibility timeout lapses, the message reappears, and another worker processes it again.
The fix is to make your worker idempotent: keyed on the document ID, processing a doc twice produces one result entry, not two. The pattern is dead simple:
def process_job(doc_id: str):
# Idempotency check first
if results_store.exists(doc_id):
return # already done — safe to ack and skip
summary = call_llm(doc_id)
results_store.set(doc_id, summary)
# ack the message only after writing
queue.ack(doc_id)
The worker flow in full:
- Receive — message is fetched; visibility timeout starts.
- Check idempotency — is this doc already in the results store?
- Process — call the LLM.
- Write result — persist to the store before acking.
- Ack / delete — remove the message from the queue; it will never reappear.
- Worker dies at step 3 or 4? — timeout lapses, message reappears, another worker retries from step 1 safely.
Retries, backoff, and the dead-letter queue
Not every failure is transient. A doc might be corrupted, too long for the context window, or trigger a content filter. If you retry blindly and forever, one bad doc ties up a worker slot endlessly and blocks the rest of the queue — the poison message problem.
The standard pattern:
- On a transient error (rate limit, network blip): retry with exponential backoff — wait 2 s, 4 s, 8 s… Most queue systems let you configure this per-queue or handle it in your worker.
- After N attempts (3–5 is common): route the message to the dead-letter queue (DLQ). The DLQ is a separate queue you inspect later. It captures everything that couldn’t be processed without blocking the main pipeline.
The DLQ is your safety valve. You inspect it on a schedule, fix the underlying issue (too-long document, bad encoding, content filter), and re-drive the messages back to the main queue.
Backpressure
A subtle but critical concept: what happens when your producer enqueues jobs faster than workers can process them? If you store all pending work in an in-memory list, the producer eventually runs out of RAM — a classic OOM crash.
The fix is backpressure: the queue is bounded and the producer blocks (or slows) when it is full. Workers pull work at their own pace; they never have work pushed onto them faster than they can handle. The queue acts as the buffer that absorbs bursts without letting them cascade into crashes.
In practice: use a proper queue library (Celery with Redis, SQS, etc.) and let it manage the bounded buffer. Never accumulate all 50k job objects in a Python list in memory before dispatching.
Throughput and cost: the Batch API
For non-realtime bulk work, check whether your LLM provider offers a Batch API. Both Anthropic and OpenAI offer asynchronous batch endpoints:
- Roughly 50% cheaper than the synchronous API.
- Higher rate limits (the provider schedules your batch across off-peak capacity).
- Results returned within 24 hours via a file you download.
The workflow: package all 50k prompts into a single batch file, submit, poll for completion, download results. You still want a queue layer for progress tracking, retries on failed items, and chunk-level observability — but the actual LLM calls go through the batch endpoint.
# Pseudocode: submit a batch of summarization requests
batch = client.batches.create(
requests=[
{"custom_id": doc_id, "method": "POST", "url": "/v1/messages",
"body": {"model": "claude-sonnet-4-5", "max_tokens": 256,
"messages": [{"role": "user",
"content": f"Summarize: {doc_text}"}]}}
for doc_id, doc_text in chunk_of_docs
]
)
# Poll until batch.processing_status == "ended"
# Then: results = client.batches.results(batch.id)
Use Batch API for overnight runs, reporting pipelines, and any processing where a 24-hour window is acceptable. Use the synchronous queue pattern when you need lower latency or interactive progress.
Putting it all together
Here is what the full reliable pipeline looks like:
1. Producer: iterate doc IDs → enqueue job per doc (SQS/Redis/etc.)
2. Worker pool: N workers each:
a. receive (visibility timeout starts)
b. idempotency check (skip if already in store)
c. call LLM (or Batch API for off-peak)
d. write result to store
e. ack → message deleted
on transient error: exponential backoff, re-enqueue
after 3 failures: route to DLQ
3. Progress tracker: count done / failed; expose endpoint for polling
4. DLQ monitor: alert on DLQ depth; re-drive after fixing root cause
The key numbers to tune for your pipeline: worker count (throughput vs rate limit), visibility timeout (longer than your 99th-percentile LLM call), max retry count (3–5), and DLQ alarm threshold.
Glossary
- Producer — the code that enqueues work onto the queue.
- Consumer / Worker — the process that pulls jobs from the queue and processes them.
- Ack (acknowledgement) — the signal a worker sends to permanently delete a message after successful processing.
- Visibility timeout — how long a received message stays invisible to other workers; if the worker doesn’t ack within this window, the message reappears.
- At-least-once delivery — the queue guarantees delivery, but may deliver the same message more than once.
- Idempotent — an operation that produces the same result no matter how many times it runs. Required when delivery is at-least-once.
- Dead-letter queue (DLQ) — a separate queue that receives messages that have exceeded the maximum retry count.
- Backpressure — the mechanism by which a slow consumer signals a fast producer to slow down, preventing memory overflow.
Quick check
Quick check
Practice this in an interview
All questionsBatch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.
Chunking splits source documents into retrievable units before embedding. The right strategy depends on document structure, query style, and the model's context window. Fixed-size chunks are simple but break mid-sentence; semantic or structural chunking preserves coherence; hierarchical chunking enables parent-document retrieval for richer context.
Airflow models pipelines as Directed Acyclic Graphs (DAGs) of tasks, each with defined dependencies. The scheduler triggers DAG runs based on a cron schedule, passing each run a logical execution date rather than the wall-clock time. A backfill re-runs a DAG over a historical date range, allowing you to populate data for past periods after adding a new pipeline or fixing a bug — as long as tasks are idempotent.
Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.