How does asyncio differ from threading, and when would you choose one over the other?

asyncio is cooperative, single-threaded concurrency: coroutines yield control explicitly at await points, so there is no GIL contention and no shared-state races. Threads are preemptive OS-level concurrency: the scheduler can switch at any bytecode boundary, which requires explicit locking. Choose asyncio for high-fan-out I/O (thousands of connections); choose threads when you need to call blocking APIs you cannot rewrite.

How do you reliably get structured outputs (JSON, typed objects) from an LLM?

Modern APIs offer constrained decoding — the model's token sampling is restricted to only produce tokens that are valid continuations of a JSON schema. Combined with Pydantic validation in application code, this eliminates the JSON-parsing errors that plagued earlier prompt-only approaches. When constrained decoding is unavailable, few-shot examples plus output parsing with retry is the fallback.

What is the difference between CPU-bound and I/O-bound work, and how does the choice affect concurrency strategy in Python?

CPU-bound work keeps the processor busy the whole time — matrix multiplication, compression, parsing. I/O-bound work spends most of its time waiting for a slow external resource — network, disk, database. The distinction directly determines which concurrency primitive to reach for: multiprocessing for CPU-bound (bypasses the GIL), threading or asyncio for I/O-bound (GIL released during waits).

What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

Async vs sync: handling concurrency — Generative AI

The waiter analogy

Imagine a restaurant where the rule is: one waiter serves one table for the entire meal — taking the order, then standing beside the table staring at the kitchen until the food arrives, then delivering it. That waiter is blocked waiting. For most of the meal they’re doing nothing useful. A 50-table restaurant needs 50 waiters.

Now imagine a competent waiter: takes Table 1’s order, submits it to the kitchen, walks over to Table 2, takes their order, checks in with Table 3, comes back when Table 1’s food is ready. One waiter handles 20 tables because the kitchen prep time (the slow part) frees them up to do other things.

An LLM server works the same way. The “kitchen” is the model — generating tokens takes hundreds of milliseconds to many seconds. Your server is the waiter. Sync = one waiter per table. Async = one competent waiter for the floor.

Why LLM requests are I/O-bound

When your server handles a call to an LLM API, the timeline looks like this:

Send the HTTP request — microseconds.
Wait while the model generates tokens — typically 200 ms to several seconds.
Receive the response bytes — milliseconds.

Steps 1 and 3 are tiny. Step 2 is almost the entire request time, and during that wait your server is doing nothing. No CPU computation, no memory churn — just waiting on bytes from a remote host. This is the definition of I/O-bound: the bottleneck is input/output (network), not the CPU.

Contrast this with CPU-bound work: computing embeddings locally, running a Torch model on-device, tokenizing a large batch. There, every millisecond of elapsed time is a millisecond of actual computation. The CPU never idles.

The async vs sync choice matters enormously for I/O-bound work. For CPU-bound work, async does nothing helpful — more on that later.

Sync: the blocking thread-per-request model

A classic synchronous web server assigns one OS thread to each incoming request. When that request calls the LLM and waits for a response, the thread blocks — it sits parked, consuming memory, until the bytes arrive. Then it finishes and returns the thread to the pool.

The cost: each OS thread needs a stack of roughly 1–8 MB. 10,000 concurrent requests means 10,000 threads means roughly 10–80 GB of stack memory — before you’ve written a single line of your application logic. On top of that, the OS must context-switch between those threads thousands of times per second, adding latency and CPU overhead for work that is, again, mostly just waiting.

In practice, thread-pool servers (like Gunicorn with sync workers) cap at a few dozen to a few hundred workers. Beyond that they either reject requests or exhaust memory. Hitting 10,000 concurrent LLM waits on a sync server means most requests queue and time out.

Each thread burns memory and an OS slot while blocked. 10,000 concurrent waits = ~10,000 threads = GBs of stack, then the server falls over.

Async: one event loop, thousands in flight

Python’s asyncio (and the event loop inside frameworks like FastAPI/Uvicorn) works differently. There is one thread running an event loop — a scheduler that continuously asks: “who is ready to run?” Requests are coroutines — functions that can pause themselves at a specific suspension point marked await.

When a coroutine hits await llm_client.post(...), it yields control back to the event loop. The loop picks up another coroutine that has data ready, runs it until its next await, and so on. When the network bytes for the first request finally arrive, the event loop resumes that coroutine right where it left off.

The key insight: a paused coroutine is almost free. It’s a few kilobytes of state on the heap — a saved stack frame and a reference to where execution should resume. Holding 10,000 in-flight awaits costs roughly tens of megabytes of heap, not gigabytes of OS thread stacks. No context-switch thrash, no kernel scheduler overhead.

The ceiling shifts from “how many OS threads can the kernel manage” to “how many connections does your upstream LLM provider allow, and how much memory for coroutine state” — both of which are far more generous.

The event loop runs active slices back-to-back. While any request awaits the LLM, its coroutine parks for free. The loop never idles unless every coroutine is awaiting simultaneously.

Concurrency vs parallelism

These terms are often confused. They mean different things.

Concurrency is dealing with many things by interleaving their progress — like a single waiter managing 20 tables. Progress happens on multiple tasks, but only one at a time is actively running. Async asyncio gives you concurrency on a single core.

Parallelism is doing many things at the same time on multiple cores (or CPUs) simultaneously — like having 4 cooks each preparing a dish in parallel. True parallel execution needs multiple OS threads or processes.

For I/O-bound LLM work, concurrency is sufficient and cheap. The CPU isn’t the bottleneck; the network wait is. Async gives you thousands of concurrent in-flight requests on one core, which is exactly what you need.

For CPU-bound work (local embedding inference, tokenizing a 100k-token batch, running a Torch model on the host), concurrency does nothing useful — the CPU is fully occupied, so interleaving just switches between tasks that all need CPU time. You need parallelism: multiple processes via multiprocessing, a thread pool (for workloads that release the GIL), or a dedicated GPU inference service.

The production pattern

Here is the shape of a production-grade async LLM endpoint:

import asyncio
import httpx
from fastapi import FastAPI

app = FastAPI()

# ONE shared async client for the whole process lifetime.
# Connection pooling is built-in — do not construct a new client per request.
_llm_client = httpx.AsyncClient(base_url="https://api.anthropic.com", timeout=120.0)

# A semaphore caps concurrent in-flight requests to the LLM.
# This is backpressure: refuse to pile up more awaits than you can serve.
_sem = asyncio.Semaphore(200)  # tune to your provider's rate limit

@app.post("/generate")
async def generate(payload: dict):
    async with _sem:                        # acquire a slot; block if 200 already in flight
        resp = await _llm_client.post(      # await: yields the event loop while waiting
            "/v1/messages",
            headers={"x-api-key": "..."},
            json=payload,
        )
    return resp.json()

Four things to notice:

async def endpoint — FastAPI/Uvicorn runs this as a coroutine; it never blocks the event loop.
Shared httpx.AsyncClient — creates and reuses a connection pool. Never construct httpx.AsyncClient() inside the request handler; that opens a new TCP connection on every call and is slower than a sync client at scale.
await on the HTTP call — yields the event loop so other requests proceed during the wait.
asyncio.Semaphore — a backpressure valve. A semaphore is a counter that allows at most N coroutines to hold it simultaneously. Without this, under extreme load you pile up thousands of awaiting coroutines, exhaust file descriptors, and OOM. With it, excess requests queue cheaply in Python (not as OS threads) and are served in order.

Fan-out with `asyncio.gather`

When one user request needs multiple LLM calls (e.g., calling a summariser and a classifier in parallel), use asyncio.gather:

summary, label = await asyncio.gather(
    call_llm(summarise_prompt),
    call_llm(classify_prompt),
)

Both calls are in flight simultaneously. Total latency is max(t_summary, t_classify), not the sum — often a 40-60% wall-clock saving.

The blocking-call trap

The single most common async mistake:

# WRONG — blocks the whole event loop for the duration of the call
@app.post("/bad")
async def bad(payload: dict):
    import anthropic
    client = anthropic.Anthropic()          # sync client
    resp = client.messages.create(...)      # blocks: no await, loop frozen
    return resp

# RIGHT — use the async client
@app.post("/good")
async def good(payload: dict):
    client = anthropic.AsyncAnthropic()     # async client
    resp = await client.messages.create(...)
    return resp

If you must call a blocking library from an async handler, offload it:

import asyncio, concurrent.futures

_pool = concurrent.futures.ThreadPoolExecutor(max_workers=10)

async def call_blocking_lib(args):
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(_pool, blocking_fn, args)

The coroutine parks at await, freeing the event loop to serve other requests. When the LLM responds, the loop resumes it. The semaphore prevents unbounded coroutine pile-up.

Backpressure and what comes next

A semaphore is the simplest form of backpressure — pushing back on upstream demand so you don’t overwhelm a downstream resource. Without it, a traffic spike means thousands of coroutines pile up in memory, all hammering the LLM API’s rate limit simultaneously, and you get a cascade of 429s and OOM kills.

The semaphore makes excess requests queue in Python, consuming almost no resources, and drains them in order as capacity frees up. This is the pattern that separates a prototype from a production service.

From here, the natural next steps are:

A request queue (Redis, SQS) to absorb burst and retry failures — separates receiving from processing.
Rate limiting per tenant so one customer can’t starve others.
Load balancing across multiple async workers (Uvicorn processes) to use all CPU cores — each process has its own event loop; a reverse proxy routes across them.

Quick check

0/3

Q1Why does an LLM request benefit from async but a local matrix multiplication does not?

Q2What happens if you call a synchronous (blocking) HTTP client inside an async FastAPI endpoint?

Q3A new endpoint does heavy local CPU work: it runs a 500ms embedding model on-device for every request. Will moving it to an async def endpoint improve throughput under load?

Async vs sync: handling concurrency

What you'll learn

Before you start

The waiter analogy

Why LLM requests are I/O-bound

Sync: the blocking thread-per-request model

Async: one event loop, thousands in flight

Concurrency vs parallelism

The production pattern

Fan-out with `asyncio.gather`

The blocking-call trap

Backpressure and what comes next

Quick check

Quick check

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further

What you'll learn

Before you start

The waiter analogy

Why LLM requests are I/O-bound

Sync: the blocking thread-per-request model

Async: one event loop, thousands in flight

Concurrency vs parallelism

The production pattern

Fan-out with asyncio.gather

The blocking-call trap

Backpressure and what comes next

Quick check

Quick check

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further

Fan-out with `asyncio.gather`