Rate limiting & denial-of-wallet
How to stop one client from running up a five-figure token bill on your inference API — the attack that targets your wallet, not your uptime.
What you'll learn
- What a denial-of-wallet attack is and why it differs from DoS
- Why you must limit TOKENS and dollars, not just requests
- The token-bucket algorithm and how its refill math works
- Fixed-window's boundary-burst flaw vs sliding window
- Layered defense: rate limit + token budget + max_tokens cap + cost alerts
Before you start
You built a product on a hosted LLM API. Every token your users send — and every token the model generates — appears on your bill. A classic denial-of-service attack tries to take your service down. A denial-of-wallet attack doesn’t bother. It just keeps sending expensive requests until your cloud bill hits five figures and your credit card declines. The service stays up. Your company goes down.
This lesson is the systems-design blueprint for stopping it.
What makes inference APIs uniquely vulnerable
With a traditional API, every request is roughly the same cost. You can reason: “100 requests per minute is safe.” With an LLM API, one request can cost 100x another. A 200-token chat message and a 128 000-token document summarization are both “one request.” The document costs two or three orders of magnitude more.
This breaks every naive rate-limiting strategy built around request counts.
The three dimensions you must limit
Rate limiting an LLM API requires enforcement on at least three axes, each keyed per API key (or per authenticated user):
| Dimension | What it limits | Why it matters |
|---|---|---|
| RPM — requests per minute | Request count | Stops trivial hammering; baseline hygiene |
| TPM — tokens per minute | Input + output tokens combined | Directly proportional to cost; the key limit |
| Daily $ budget | Estimated spend | Hard financial circuit-breaker |
Most cloud LLM providers expose all three. Your own gateway must enforce all three too, because a single provider-side limit is not enough when you’re aggregating across multiple users or models.
The token-bucket algorithm
The most widely used rate-limiting algorithm is the token bucket. Despite the name, the “tokens” here are capacity units — nothing to do with LLM tokens. The intuition is deliberately mechanical:
- Imagine a bucket that holds up to C units (the capacity or burst limit).
- The bucket refills at a constant rate of r units per second.
- Every incoming request consumes some units from the bucket. A request of weight w is allowed only if the bucket currently holds at least w units; after allowing it, deduct w. If the bucket is empty (or below w), reject with HTTP 429 and a
Retry-Afterheader.
The refill math at any moment:
bucket = min(C, current + r × t)
where t is the number of seconds since the last check. Because you only ever compute the bucket on a request, you don’t need a background timer — you just compute how much would have accumulated.
Concrete example. C = 60, r = 1/s. A client that was idle for 30 seconds arrives with 30 units in the bucket (min(60, 0 + 1×30)). It can burst up to 60 requests in a short window, then is sustained at 1/s thereafter. A client hammering from the start burns 60 units instantly, then gets exactly 1 req/s after that.
Fixed-window vs sliding-window counters
When people first implement rate limiting without a token bucket, they often reach for a fixed-window counter: increment a counter keyed to the current minute (e.g. ratelimit:user123:2026-05-29T14:07) and reject if the counter exceeds the limit.
It’s simple and fast. It also has a critical flaw.
The boundary-burst problem
A fixed-window limit of 100 requests/minute does not guarantee fewer than 100 requests in any 60-second span. A client can:
- Send 100 requests at 14:07:58 (end of window 14:07 — all pass).
- Immediately send 100 more requests at 14:08:01 (start of window 14:08 — all pass).
Result: 200 requests in 6 seconds. The attacker effectively doubles your limit at every window boundary.
A sliding-window approach — either a log-based window (store a timestamp for each request, count those within the last 60 s) or a weighted approximation (blend the current-window count with a fraction of the previous-window count) — eliminates this artifact by keeping the count continuously up to date.
Leaky bucket — the third variant
The leaky bucket is the inverse of the token bucket: instead of accumulating capacity that requests drain, incoming requests fill a queue and are processed at a constant output rate. This smooths bursty input into a steady stream. It’s useful for outbound call shaping (e.g. your service calling a downstream API at a capped rate) but less common for admission control, where the token bucket’s burst allowance is usually preferable.
Layered defense
No single control is enough. A full defense applies all of the following:
- RPM limit at the gateway. Stops the most trivial hammering and protects your infrastructure from connection exhaustion. Keyed per API key.
- TPM / token budget at the gateway. The core anti-DoW control. Sum input + output tokens per request as they flow through; reject when the per-minute token budget is exceeded.
max_tokenscap per request. Pass a hardmax_tokensparameter to the model on every call. This is your per-request firewall — it prevents any single request from generating a runaway multi-thousand-token response, even if the token budget hasn’t been hit yet.- Daily dollar cap + alerting. Set hard monthly/daily spend limits in your cloud provider dashboard. Wire cost alerts at 50%, 80%, and 100% of budget. This is your financial circuit-breaker — if everything else fails, the bill stops growing and you get paged.
- Authentication. All of the above is useless if limits aren’t per-identity. Require an API key on every request; enforce limits keyed to that identity. Anonymous endpoints are rate-limited by IP, which is trivially bypassed.
Where and how to enforce
At the gateway, not the application. Your rate-limiting middleware should sit in front of your application logic, ideally in a dedicated reverse proxy or API gateway layer (nginx, Envoy, Kong, a custom FastAPI middleware). This ensures limits apply regardless of which backend instance handles the request.
Distributed counters in Redis. For multi-instance deployments, per-process in-memory counters won’t work — each instance sees only a slice of traffic. Use Redis with atomic increment (INCR / INCRBY) and a TTL for the expiry window. A sliding log uses ZADD + ZCOUNT on a sorted set of timestamps.
Return 429 with Retry-After. A bare 429 is sufficient for protection, but adding Retry-After: 60 lets well-behaved clients back off gracefully and retry rather than hammering indefinitely or giving up.
Estimate tokens before the call. For TPM limiting, you need to count input tokens before sending the request (use a library like tiktoken for OpenAI models or the provider’s counting endpoint). Output tokens are only known after the fact — account for them by tracking a rolling spend and including the previous request’s output in the next check.
Summary
Denial-of-wallet is the LLM-specific threat that makes your service unprofitable rather than unavailable. The defense is a set of layers, all enforced per identity at the gateway:
- Token bucket for RPM/TPM enforcement — bursts allowed, sustained rate controlled.
- Sliding window over fixed-window — no boundary-burst artifacts.
max_tokenson every request — per-call runaway prevention.- Daily $ cap + alerts — the financial circuit-breaker.
- Auth everywhere — limits are meaningless without identity.
Each layer is cheap to add and cheap to operate. The cost of skipping any one of them is a bill you discover at midnight.
Quick check
Practice this in an interview
All questionsCost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.
A token is the smallest unit a language model processes — typically a word, sub-word fragment, or punctuation mark produced by a byte-pair encoding (BPE) or similar algorithm. Pricing is per token because each token requires one forward-pass position in the attention matrix, directly driving compute and memory cost regardless of whether it maps to a full word or a single letter.
Prompt injection is an attack in which malicious text in retrieved documents or user input overrides the application's system instructions, redirecting the model to perform unintended actions. Defenses layer input/output validation, privilege separation, and tool-call confirmation — no single fix is sufficient.
ML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.