datarekha

Queues & batch pipelines

50,000 documents won't go through your summarization pipeline in one request. How queues, workers, retries, and backpressure get the job done.

9 min read Advanced Generative AI Lesson 23 of 24

What you'll learn

  • Why large/slow jobs are enqueued, not handled in the request
  • Producer to queue to worker pool: decoupling and independent scaling
  • At-least-once delivery, idempotency, acks and visibility timeouts
  • Retries with backoff and the dead-letter queue for poison messages
  • Backpressure and the Batch API for cheap, high-throughput processing

Before you start

The naive approach fails immediately

Your first instinct might be to write a loop: iterate over 50,000 doc IDs in a single HTTP request handler, call the LLM for each one, and return when done. That fails hard:

  • Timeout — HTTP requests time out in seconds to minutes; 50k LLM calls take hours.
  • No partial recovery — if the server crashes on doc 30,000 you restart from zero.
  • No progress visibility — the client hangs with zero feedback.
  • Rate limits — firing all 50k calls in a tight loop immediately hits the provider’s tokens-per-minute cap.

The solution is to make the work asynchronous and durable by routing it through a queue.

The kitchen analogy

Imagine a busy restaurant. Customers hand orders to a cashier — the producer — who clips them to a rail above the kitchen — the queue. Cooks — the workers — pull tickets at whatever pace the kitchen can handle. The cashier doesn’t wait for the food; the customer gets a buzzer and comes back when it’s ready. If a cook burns a dish (worker crash), the ticket is still on the rail for another cook to retry.

That rail is a message queue. It decouples the rate at which work arrives from the rate at which it gets done.

The pipeline

In software terms:

  1. Producer — your API endpoint or script receives the 50k doc IDs and enqueues one job per doc (SQS, RabbitMQ, Redis Streams, Kafka, or Celery — all follow the same model). This takes seconds, not hours.
  2. Queue — durable storage for jobs; survives restarts; buffers bursts.
  3. Worker pool — N processes (W1 … W4) each pull one job, call the LLM, and write the result to a results store (database, object storage).
  4. Results store — where finished summaries land; your client polls here or receives a webhook.

Your worker count is the concurrency knob that determines throughput. Add workers to go faster; reduce workers to respect the LLM’s rate limit. The queue absorbs the difference.

Producerenqueues 50k jobs(returns immediately)Queuejob: doc_49999job: doc_49998job: doc_49997… 50k totalpullWorker poolconcurrency = rate-limit knobW1W2W3W4each: receive → LLM call → ackwriteResultsstore✓ summaries
The async pipeline. The producer returns instantly; workers drain the queue at the rate your LLM tier allows.

Why this architecture wins

PropertySynchronous loopQueue + workers
Survives server crashNo — restart from zeroYes — unacked jobs reappear
Scales processingVertical onlyAdd workers horizontally
Rate-limit controlHardWorker count = concurrency knob
Client feedbackNone until donePoll / webhook on partial progress
Burst handlingOOMs or times outQueue absorbs the burst

Delivery semantics and idempotency

Most queues (SQS, RabbitMQ, Redis) give you at-least-once delivery: a message is guaranteed to be delivered, but it may be delivered more than once.

Here is when that happens: a worker receives a job (the message becomes invisible for a configurable visibility timeout), calls the LLM, writes the result — and then crashes before sending the ack (acknowledgement / delete). The visibility timeout lapses, the message reappears, and another worker processes it again.

The fix is to make your worker idempotent: keyed on the document ID, processing a doc twice produces one result entry, not two. The pattern is dead simple:

def process_job(doc_id: str):
    # Idempotency check first
    if results_store.exists(doc_id):
        return  # already done — safe to ack and skip
    summary = call_llm(doc_id)
    results_store.set(doc_id, summary)
    # ack the message only after writing
    queue.ack(doc_id)

The worker flow in full:

  1. Receive — message is fetched; visibility timeout starts.
  2. Check idempotency — is this doc already in the results store?
  3. Process — call the LLM.
  4. Write result — persist to the store before acking.
  5. Ack / delete — remove the message from the queue; it will never reappear.
  6. Worker dies at step 3 or 4? — timeout lapses, message reappears, another worker retries from step 1 safely.

Retries, backoff, and the dead-letter queue

Not every failure is transient. A doc might be corrupted, too long for the context window, or trigger a content filter. If you retry blindly and forever, one bad doc ties up a worker slot endlessly and blocks the rest of the queue — the poison message problem.

The standard pattern:

  • On a transient error (rate limit, network blip): retry with exponential backoff — wait 2 s, 4 s, 8 s… Most queue systems let you configure this per-queue or handle it in your worker.
  • After N attempts (3–5 is common): route the message to the dead-letter queue (DLQ). The DLQ is a separate queue you inspect later. It captures everything that couldn’t be processed without blocking the main pipeline.
Jobfrom queueProcesscall LLMwrite resultsuccessAck✓ deletedfailRetry w/ backoff2 s → 4 s → 8 sattempt 1 / 2 / 3re-queue× 3 failsDead-LetterQueue (DLQ)inspect later
Success acks immediately. Transient failures retry with exponential backoff. After 3 failures the job moves to the DLQ — out of the main pipeline so other jobs can proceed.

The DLQ is your safety valve. You inspect it on a schedule, fix the underlying issue (too-long document, bad encoding, content filter), and re-drive the messages back to the main queue.

Backpressure

A subtle but critical concept: what happens when your producer enqueues jobs faster than workers can process them? If you store all pending work in an in-memory list, the producer eventually runs out of RAM — a classic OOM crash.

The fix is backpressure: the queue is bounded and the producer blocks (or slows) when it is full. Workers pull work at their own pace; they never have work pushed onto them faster than they can handle. The queue acts as the buffer that absorbs bursts without letting them cascade into crashes.

✓ Bounded queue (good)Producerfast burstQueue(bounded)absorbs burstback-pressure(slow down)pullWorkerssteady pace✗ Unbounded list (bad)Producerfast burstIn-memorylist↑ growing fastOOMcrashQueue is the buffer. Workers pull at their pace.Producer slows when queue is full.No buffer = unbounded growth.Producer never slows = OOM.
Backpressure (left): the bounded queue signals the producer to slow when full, keeping memory under control. No backpressure (right): an unbounded in-memory list grows without limit until the process crashes.

In practice: use a proper queue library (Celery with Redis, SQS, etc.) and let it manage the bounded buffer. Never accumulate all 50k job objects in a Python list in memory before dispatching.

Throughput and cost: the Batch API

For non-realtime bulk work, check whether your LLM provider offers a Batch API. Both Anthropic and OpenAI offer asynchronous batch endpoints:

  • Roughly 50% cheaper than the synchronous API.
  • Higher rate limits (the provider schedules your batch across off-peak capacity).
  • Results returned within 24 hours via a file you download.

The workflow: package all 50k prompts into a single batch file, submit, poll for completion, download results. You still want a queue layer for progress tracking, retries on failed items, and chunk-level observability — but the actual LLM calls go through the batch endpoint.

# Pseudocode: submit a batch of summarization requests
batch = client.batches.create(
    requests=[
        {"custom_id": doc_id, "method": "POST", "url": "/v1/messages",
         "body": {"model": "claude-sonnet-4-5", "max_tokens": 256,
                  "messages": [{"role": "user",
                                "content": f"Summarize: {doc_text}"}]}}
        for doc_id, doc_text in chunk_of_docs
    ]
)
# Poll until batch.processing_status == "ended"
# Then: results = client.batches.results(batch.id)

Use Batch API for overnight runs, reporting pipelines, and any processing where a 24-hour window is acceptable. Use the synchronous queue pattern when you need lower latency or interactive progress.

Putting it all together

Here is what the full reliable pipeline looks like:

1. Producer: iterate doc IDs → enqueue job per doc (SQS/Redis/etc.)
2. Worker pool: N workers each:
     a. receive (visibility timeout starts)
     b. idempotency check (skip if already in store)
     c. call LLM (or Batch API for off-peak)
     d. write result to store
     e. ack → message deleted
     on transient error: exponential backoff, re-enqueue
     after 3 failures: route to DLQ
3. Progress tracker: count done / failed; expose endpoint for polling
4. DLQ monitor: alert on DLQ depth; re-drive after fixing root cause

The key numbers to tune for your pipeline: worker count (throughput vs rate limit), visibility timeout (longer than your 99th-percentile LLM call), max retry count (3–5), and DLQ alarm threshold.

Glossary

  • Producer — the code that enqueues work onto the queue.
  • Consumer / Worker — the process that pulls jobs from the queue and processes them.
  • Ack (acknowledgement) — the signal a worker sends to permanently delete a message after successful processing.
  • Visibility timeout — how long a received message stays invisible to other workers; if the worker doesn’t ack within this window, the message reappears.
  • At-least-once delivery — the queue guarantees delivery, but may deliver the same message more than once.
  • Idempotent — an operation that produces the same result no matter how many times it runs. Required when delivery is at-least-once.
  • Dead-letter queue (DLQ) — a separate queue that receives messages that have exceeded the maximum retry count.
  • Backpressure — the mechanism by which a slow consumer signals a fast producer to slow down, preventing memory overflow.

Quick check

Quick check

0/3
Q1A worker receives a doc, calls the LLM, writes the summary — then crashes before sending the ack. What happens next, and why does idempotency matter?
Q2Why do you route a message to the dead-letter queue after N failures instead of retrying forever?
Q3When should you prefer the provider's Batch API over a synchronous worker-per-doc approach?

Practice this in an interview

All questions
What is the difference between batch and streaming data pipelines, and how do you choose between them?

Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.

What chunking strategies exist for RAG and how do you choose between them?

Chunking splits source documents into retrievable units before embedding. The right strategy depends on document structure, query style, and the model's context window. Fixed-size chunks are simple but break mid-sentence; semantic or structural chunking preserves coherence; hierarchical chunking enables parent-document retrieval for richer context.

How does Apache Airflow work, and what is a DAG backfill?

Airflow models pipelines as Directed Acyclic Graphs (DAGs) of tasks, each with defined dependencies. The scheduler triggers DAG runs based on a cron schedule, passing each run a logical execution date rather than the wall-clock time. A backfill re-runs a DAG over a historical date range, allowing you to populate data for past periods after adding a new pipeline or fixing a bug — as long as tasks are idempotent.

What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content