datarekha

Rate limiting & denial-of-wallet

How to stop one client from running up a five-figure token bill on your inference API — the attack that targets your wallet, not your uptime.

9 min read Advanced Generative AI Lesson 21 of 24

What you'll learn

  • What a denial-of-wallet attack is and why it differs from DoS
  • Why you must limit TOKENS and dollars, not just requests
  • The token-bucket algorithm and how its refill math works
  • Fixed-window's boundary-burst flaw vs sliding window
  • Layered defense: rate limit + token budget + max_tokens cap + cost alerts

Before you start

You built a product on a hosted LLM API. Every token your users send — and every token the model generates — appears on your bill. A classic denial-of-service attack tries to take your service down. A denial-of-wallet attack doesn’t bother. It just keeps sending expensive requests until your cloud bill hits five figures and your credit card declines. The service stays up. Your company goes down.

This lesson is the systems-design blueprint for stopping it.


What makes inference APIs uniquely vulnerable

With a traditional API, every request is roughly the same cost. You can reason: “100 requests per minute is safe.” With an LLM API, one request can cost 100x another. A 200-token chat message and a 128 000-token document summarization are both “one request.” The document costs two or three orders of magnitude more.

This breaks every naive rate-limiting strategy built around request counts.


The three dimensions you must limit

Rate limiting an LLM API requires enforcement on at least three axes, each keyed per API key (or per authenticated user):

DimensionWhat it limitsWhy it matters
RPM — requests per minuteRequest countStops trivial hammering; baseline hygiene
TPM — tokens per minuteInput + output tokens combinedDirectly proportional to cost; the key limit
Daily $ budgetEstimated spendHard financial circuit-breaker

Most cloud LLM providers expose all three. Your own gateway must enforce all three too, because a single provider-side limit is not enough when you’re aggregating across multiple users or models.


The token-bucket algorithm

The most widely used rate-limiting algorithm is the token bucket. Despite the name, the “tokens” here are capacity units — nothing to do with LLM tokens. The intuition is deliberately mechanical:

  • Imagine a bucket that holds up to C units (the capacity or burst limit).
  • The bucket refills at a constant rate of r units per second.
  • Every incoming request consumes some units from the bucket. A request of weight w is allowed only if the bucket currently holds at least w units; after allowing it, deduct w. If the bucket is empty (or below w), reject with HTTP 429 and a Retry-After header.

The refill math at any moment:

bucket = min(C, current + r × t)

where t is the number of seconds since the last check. Because you only ever compute the bucket on a request, you don’t need a background timer — you just compute how much would have accumulated.

Concrete example. C = 60, r = 1/s. A client that was idle for 30 seconds arrives with 30 units in the bucket (min(60, 0 + 1×30)). It can burst up to 60 requests in a short window, then is sustained at 1/s thereafter. A client hammering from the start burns 60 units instantly, then gets exactly 1 req/s after that.

Refill: r tokens/sC = burst capacitycurrent levelToken Bucketcapacity C✓ Request allowed — deduct wRequest (weight w)Bucket empty(w > current)insufficientHTTP 429Retry-After: NSustained rate r req/s · Burst up to C · Refill: bucket = min(C, current + r×t)
Token bucket: refills at r tokens/s up to capacity C. Requests drain the bucket; an empty bucket triggers HTTP 429 with a Retry-After header.

Fixed-window vs sliding-window counters

When people first implement rate limiting without a token bucket, they often reach for a fixed-window counter: increment a counter keyed to the current minute (e.g. ratelimit:user123:2026-05-29T14:07) and reject if the counter exceeds the limit.

It’s simple and fast. It also has a critical flaw.

The boundary-burst problem

A fixed-window limit of 100 requests/minute does not guarantee fewer than 100 requests in any 60-second span. A client can:

  1. Send 100 requests at 14:07:58 (end of window 14:07 — all pass).
  2. Immediately send 100 more requests at 14:08:01 (start of window 14:08 — all pass).

Result: 200 requests in 6 seconds. The attacker effectively doubles your limit at every window boundary.

A sliding-window approach — either a log-based window (store a timestamp for each request, count those within the last 60 s) or a weighted approximation (blend the current-window count with a fraction of the previous-window count) — eliminates this artifact by keeping the count continuously up to date.

Fixed window (limit 100/min)Window 1: 14:07:00 — 14:07:59counter = 100 ✓ (all pass)Window 2: 14:08:00 — 14:08:59counter = 100 ✓ (all pass)100 req burst100 req burst↑ 200 requests in ~3 s across the boundary — 2× the intended limitSliding window (same 100/min limit)Rolling 60 s windowcount always ≤ 100✓ No boundary artifact — any 60 s span always sees at most 100 requests
Fixed-window counters reset at wall-clock boundaries, letting a burst at t=:59 and another at t=:00 pass 2× the intended limit. Sliding windows track the rolling 60 s span and eliminate the artifact.

Leaky bucket — the third variant

The leaky bucket is the inverse of the token bucket: instead of accumulating capacity that requests drain, incoming requests fill a queue and are processed at a constant output rate. This smooths bursty input into a steady stream. It’s useful for outbound call shaping (e.g. your service calling a downstream API at a capped rate) but less common for admission control, where the token bucket’s burst allowance is usually preferable.


Layered defense

No single control is enough. A full defense applies all of the following:

  1. RPM limit at the gateway. Stops the most trivial hammering and protects your infrastructure from connection exhaustion. Keyed per API key.
  2. TPM / token budget at the gateway. The core anti-DoW control. Sum input + output tokens per request as they flow through; reject when the per-minute token budget is exceeded.
  3. max_tokens cap per request. Pass a hard max_tokens parameter to the model on every call. This is your per-request firewall — it prevents any single request from generating a runaway multi-thousand-token response, even if the token budget hasn’t been hit yet.
  4. Daily dollar cap + alerting. Set hard monthly/daily spend limits in your cloud provider dashboard. Wire cost alerts at 50%, 80%, and 100% of budget. This is your financial circuit-breaker — if everything else fails, the bill stops growing and you get paged.
  5. Authentication. All of the above is useless if limits aren’t per-identity. Require an API key on every request; enforce limits keyed to that identity. Anonymous endpoints are rate-limited by IP, which is trivially bypassed.
RequestRPM limitreq count/minTPM / tokenbudgetinput+output tok/minDoW attacker → BLOCKEDmax_tokensper-request capDaily $ cap+ cost alertcircuit-breakerLLMEach gate is keyed per API key. The DoW attacker is stopped at the TPM gate before tokens (= dollars) are consumed.
Layered defense: a request must pass RPM, token budget, per-request max_tokens, and the daily spend cap before reaching the model. A denial-of-wallet attacker is neutralized at the token-budget gate.

Where and how to enforce

At the gateway, not the application. Your rate-limiting middleware should sit in front of your application logic, ideally in a dedicated reverse proxy or API gateway layer (nginx, Envoy, Kong, a custom FastAPI middleware). This ensures limits apply regardless of which backend instance handles the request.

Distributed counters in Redis. For multi-instance deployments, per-process in-memory counters won’t work — each instance sees only a slice of traffic. Use Redis with atomic increment (INCR / INCRBY) and a TTL for the expiry window. A sliding log uses ZADD + ZCOUNT on a sorted set of timestamps.

Return 429 with Retry-After. A bare 429 is sufficient for protection, but adding Retry-After: 60 lets well-behaved clients back off gracefully and retry rather than hammering indefinitely or giving up.

Estimate tokens before the call. For TPM limiting, you need to count input tokens before sending the request (use a library like tiktoken for OpenAI models or the provider’s counting endpoint). Output tokens are only known after the fact — account for them by tracking a rolling spend and including the previous request’s output in the next check.


Summary

Denial-of-wallet is the LLM-specific threat that makes your service unprofitable rather than unavailable. The defense is a set of layers, all enforced per identity at the gateway:

  • Token bucket for RPM/TPM enforcement — bursts allowed, sustained rate controlled.
  • Sliding window over fixed-window — no boundary-burst artifacts.
  • max_tokens on every request — per-call runaway prevention.
  • Daily $ cap + alerts — the financial circuit-breaker.
  • Auth everywhere — limits are meaningless without identity.

Each layer is cheap to add and cheap to operate. The cost of skipping any one of them is a bill you discover at midnight.


Quick check

0/3
Q1A token bucket is configured with C = 120 and r = 2 tokens/s. A client has been idle for 30 seconds, then sends 80 requests simultaneously. How many are allowed?
Q2Your inference gateway enforces a fixed-window limit of 100 requests per minute. An attacker sends 100 requests at 14:07:59 and another 100 at 14:08:01. How many requests pass in total?
Q3You enforce an RPM limit of 60 requests/minute but no token limit. An attacker sends 60 requests per minute, each with a 100 000-token input and requesting 4 000-token outputs. Why is this still a problem?

Practice this in an interview

All questions
What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

What are tokens in an LLM and why is API pricing per token rather than per word or character?

A token is the smallest unit a language model processes — typically a word, sub-word fragment, or punctuation mark produced by a byte-pair encoding (BPE) or similar algorithm. Pricing is per token because each token requires one forward-pass position in the attention matrix, directly driving compute and memory cost regardless of whether it maps to a full word or a single letter.

What is prompt injection and how do you defend against it?

Prompt injection is an attack in which malicious text in retrieved documents or user input overrides the application's system instructions, redirecting the model to perform unintended actions. Defenses layer input/output validation, privilege separation, and tool-call confirmation — no single fix is sufficient.

How does autoscaling work for ML inference services, and what metrics should drive it?

ML inference services should scale on request queue depth or GPU utilization rather than CPU utilization alone, because GPU-heavy workloads keep CPU near-idle even under full load. Horizontal Pod Autoscaler in Kubernetes can be configured with custom metrics, and scale-to-zero with a warm-up buffer prevents cold-start latency spikes.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content