datarekha
Infrastructure May 25, 2026

50,000 documents, one summarization pipeline

You can't loop 50k LLM calls inside a request. The shape that works — a queue, a pool of workers, and the boring reliability details that decide whether it finishes.

8 min read · by datarekha · llmpipelinesqueuesbatchinfrastructure

The ask was simple. “Just summarize all our documents.” The engineer who got the ticket wrote the obvious script: iterate the corpus, call the LLM, collect results. Forty-seven lines of Python. It ran fine against the first hundred docs in staging.

Then it hit production — 50,000 documents, real network, a rate-limited API key — and it died in every way a naive loop can die.

First death: the upstream service that triggered the job had a 30-second timeout. The script was still processing doc number 12 when the connection closed. The caller got a 504 and assumed failure. Nothing had failed. Fifty thousand documents were now being processed twice, racing against each other, writing half-results to the same output table.

Second death: the script crashed at doc 31,402. A single malformed PDF threw an unhandled exception. The process exited. There was no record of which documents had been processed. Starting over meant re-running all 31,000 completed documents — at full API cost — or manually reconstructing state from partial outputs. The team chose the latter and spent a weekend doing it.

Third death: the rate limiter. The loop had no throttle. Workers sent requests as fast as Python could iterate a list, which is very fast. By doc 800 the API was returning 429s. The exponential backoff added to the retry logic meant the script was now sleeping more than it was working, and the total estimated runtime had ballooned to 19 hours.

Nobody measured how far it got. There was no progress counter, no success/failure log, no way to answer the question how many docs do we have left?

This is the obligatory naive-loop autopsy. Every team runs it once. The lesson is not “be more careful.” The lesson is that the shape of the solution is wrong.

The shape that actually works

The correct architecture for 50,000 independent LLM calls has three parts: a producer, a queue, and a pool of workers.

The producer’s job is narrow: read the list of documents, push one job per document onto the queue, and stop. It runs for a few seconds. It does not call the LLM. It does not wait for results. When the client asks “start the summarization run,” the producer enqueues the work and immediately returns a job ID. The client polls that ID — or gets a webhook — to learn when the run completes. The HTTP request that triggered everything resolves in milliseconds.

The queue is a durable buffer. SQS, RabbitMQ, Kafka, Redis Streams — the technology choice matters less than the property you’re buying: the queue outlives any single process. Messages persist. If every worker dies simultaneously, the jobs are still there when workers come back. The queue is the spine of the whole operation.

The workers are the only things that touch the LLM. Each worker pulls one job off the queue, calls the API, writes the result to a store (a database keyed by document ID), and acknowledges the message. Worker count is your concurrency knob. Want to respect a 100-requests-per-minute rate limit? Run five workers, each pulling one job every three seconds. Want to go faster once you’ve negotiated a higher limit? Add workers. The throttle is mechanical and easy to reason about — it’s not a sleep call buried inside a retry block.

Producerenqueues 50k jobsQueueSQS / RabbitMQdurable, boundedWorker Poolpull → call LLMack on successN workers = rate knobResultskeyed by doc idretry (backoff)Dead-Letter Queuetriage later
Producer → Queue → Worker Pool → Results. Failures retry with backoff; after N attempts the job routes to the dead-letter queue instead of looping forever.

The reliability lessons, told as scars

At-least-once delivery. This is the property that surprises new queue users. When a worker pulls a job, the queue does not immediately delete the message. It hides it for a configurable visibility timeout — say, 90 seconds. If the worker processes the job successfully and sends an acknowledgment, the message is deleted. If the worker dies mid-flight, the timeout expires and the message becomes visible again; another worker picks it up. No work is lost.

The flip side: a worker can successfully process a job and then crash before it sends the ack. The queue will re-deliver the message. That document will be processed twice. This is not a bug in the queue — it is the explicit trade-off queues make for durability. Your code must account for it.

The fix is idempotency: key your results by document ID, and make writing the result a no-op if the result already exists. Processing the same document twice produces one record, not two. No double charge. No corrupted state. This is a five-line change that makes the entire system safe to retry.

Failure handling. Transient errors — a momentary 503, a network blip, a provider throttle — should trigger an exponential backoff retry. Most queue systems let you configure this natively: retry after 10 seconds, then 30, then 90, up to N attempts. This handles the 429 problem without any application-level sleep loops.

But some documents will never succeed. A PDF that the parser cannot decode. A document so long it exceeds the model’s context window. A file that is actually a renamed .exe. After N retry attempts, these jobs must stop retrying — otherwise they loop forever, consuming retries and delaying real work. Route them to a dead-letter queue: a separate queue that catches exhausted jobs for human inspection. Your pipeline finishes. You triage the dead-letter queue later, on your own schedule.

Backpressure. An in-memory list of 50,000 document objects is not a queue — it is an OOM waiting to happen. The producer reads the document metadata (IDs, paths) and pushes lightweight messages onto the queue; the actual documents are fetched from storage by each worker at processing time. The queue is bounded: configure a maximum depth and the producer blocks when it fills. This means the producer cannot get arbitrarily far ahead of the workers. Memory usage is flat regardless of corpus size. The system degrades gracefully when workers fall behind.

The cost lever nobody mentions early enough

Once the pipeline is correct, there is one more consideration that changes the economics of bulk work: the provider’s Batch API.

For synchronous, real-time LLM calls, you pay the standard per-token rate. But for bulk work where results can wait until morning, Claude, OpenAI, and most major providers offer an asynchronous batch endpoint at roughly half the price. You submit a file of requests, get back a job ID, and poll for completion. Results are typically available within a few hours, sometimes faster.

For a 50,000-document run, that discount is material. If your summarizations average 1,000 tokens each and the model costs $15 per million output tokens, you are looking at roughly $750 in synchronous calls — or closer to $375 through the Batch API. Running the job overnight costs half as much and requires no changes to the queue architecture: workers simply call the batch endpoint instead of the synchronous one and track completion asynchronously.

Track done_count and failed_count separately, persist them to a durable store, and implement resume logic: on restart, skip document IDs that already have results. Combined with idempotency, resume makes the pipeline resilient to any mid-run interruption — infrastructure failure, a provider outage, a deliberate pause to adjust worker count.

What the LLM call actually is

Here is the number that recalibrates how people think about this problem: the LLM call itself is roughly 10% of the engineering work. One line of code, one network call, one JSON response. It is the easiest part of the pipeline.

The other 90% — the queue, the worker pool, the idempotency logic, the retry configuration, the dead-letter queue, the backpressure mechanism, the progress tracking, the resume logic, the Batch API integration — is systems design. It is the same infrastructure that handles any high-volume, reliable background job, and it is the part that decides whether 50,000 documents actually finish or whether someone spends a weekend manually reconstructing state from partial outputs.

The “just summarize all our documents” ask is not an LLM problem. It is a distributed systems problem with an LLM inside it. Teams that treat it as the former rewrite the script three times. Teams that treat it as the latter build it once and run it again on the next 50,000 documents without incident.

The pipeline primitives here — producer, queue, worker pool, idempotency, DLQ, batch endpoint — are not novel. They are the same patterns covered in any serious treatment of queues and batch pipelines. The only thing that changes when LLMs enter the picture is the cost curve and the rate-limit math. The shape is the shape it has always been.

Skip to content