Denial of wallet: the attack that bankrupts your AI app

The alert arrives at 2:07 a.m. Not a pager screaming about downtime. Not an error rate spiking on your dashboard. Just a quiet email from your cloud billing system: you have reached 80% of your monthly budget. It is the third day of the month. You open your laptop and pull up the cost graph. The line is nearly vertical — flat for weeks, then a cliff face starting at 11 p.m. By the time you read the alert, you have already spent $14,000. By the time you find the kill switch, you will have spent $38,000. The app never went down. Users never noticed. Your servers never flinched.

That is the point.

DoS versus DoW

A denial-of-service attack has one goal: break your availability. Flood the network, exhaust your memory, crash a process. The attacker wins when your service goes offline. Defenders have spent two decades building shields for exactly this — CDNs, rate limiters, anycast routing, TCP SYN cookies. The defenses are mature. The playbook is written.

Denial-of-wallet is a different game entirely. The attacker does not want your service offline. An offline service costs them nothing. They want your invoice to explode. They win when you pay, not when you crash.

The attack surface opened up the moment the tech industry wired expensive, metered APIs into the critical path of web applications. LLM inference is the widest part of that surface right now. Every token you generate on behalf of a user goes on your tab. Claude Sonnet charges per million input tokens and per million output tokens. GPT-4o, Gemini 1.5, Llama via any managed inference endpoint — same structure. The model vendor doesn’t care who prompted the request. If the request hits your API key, the cost is yours.

An attacker who discovers your completion endpoint — or, more commonly, a legitimate user who figures out the economics — can craft the most expensive request your schema allows and fire it in a loop. No exploit required. No vulnerability to patch. Just a POST request and patience.

The trap: counting requests

The instinctive defense is a rate limit. “We allow 100 requests per minute per API key. We’re fine.”

You are not fine. Here is the math that breaks that intuition.

A short conversational exchange — classify this sentence, translate this phrase, answer this yes/no question — might cost 40 to 80 tokens total. A maximum-context completion on a current frontier model can easily run to 4,000 input tokens plus 4,000 output tokens or more. That is a 100x cost difference between the cheapest request your schema allows and the most expensive one, both of which count as exactly one request against your rate limit.

At 100 requests per minute, a user sending 40-token pings consumes roughly $0.02 of budget per minute at typical Sonnet pricing. A user sending 8,000-token max-context completions at the same request rate consumes roughly $2.00 per minute. Same rate limit, 100x the damage. By the time your RPM counter trips, the second user has already spent $120 in sixty seconds.

Requests are the wrong unit. Tokens are what bill you. Dollars are what hurt you. Your defense has to be denominated in the units that actually matter.

Building the gauntlet

Hardening against denial-of-wallet is not a single control. It is a layered gauntlet that the request must pass through before it ever reaches the model. Each layer closes a gap the previous one leaves open.

The five-layer gauntlet. Most DoW attacks die at the TPM/$ gate — the first layer denominated in the unit that actually bills you.

Layer 1: authentication that means something. An API endpoint with no auth is a public tap on your billing account. Even a simple API key — one per registered user, one per tenant on a SaaS product — is the minimum necessary before any other limit is meaningful. Without identity, a per-identity rate limit is nothing. This sounds obvious. The number of LLM-backed products that went to production with a global unauthenticated endpoint suggests it is not obvious enough.

Layer 2: requests per minute, per identity. This is the classic rate limit and it still earns its place. It stops naive flooding, prevents accidental loops from hammering the endpoint, and keeps your request logs readable. It just doesn’t stop the attacker who has read this far.

Layer 3: tokens per minute and dollars per minute, per identity. This is the layer that actually does the work. Implement a token-bucket counter that tracks the number of tokens consumed — counting both prompt tokens and completion tokens — per API key per rolling minute. When the bucket is empty, return a 429. Set the bucket size to whatever your pricing model supports for that tier: a free user might get 20,000 TPM; a paid user might get 200,000. This is how you contain the 8,000-token-per-request attacker that your RPM limit lets straight through.

Layer 4: cap max_tokens on every completion call. This one is inside your application code, not your gateway. When you call the model, you set max_tokens in the request body. If you leave it at the model’s default maximum, you are allowing every completion to run to the full context limit. Set it to the longest output your use case legitimately needs — chat responses rarely need more than 1,000 tokens; structured extraction rarely needs more than 500. A hard max_tokens ceiling means a single call has a known worst-case cost before it hits the model. This is also good for your latency and your response quality; it is free defense.

Layer 5: a daily dollar ceiling and a global kill-switch. The previous layers are per-identity and per-request. This layer is aggregate. Set a hard daily spend ceiling per tenant — a free tier tenant might cap at $0.50/day, a paid tenant at $5.00/day — and wire it to a circuit-breaker that rejects all completions for that tenant until the next billing window. Separately, set a global ceiling for your entire application and alert at 50%, 80%, and 100% thresholds. The global kill-switch is the last resort: when the global ceiling trips, the app returns degraded responses (or a polite error) rather than continuing to spend. You will never need it if the per-identity layers hold. You will be glad it exists the one time they don’t.

Patterns worth watching

Spending doesn’t spike randomly. There are signatures. Very long prompts that arrive within seconds of authentication — a user who just got an API key and immediately sent a 16,000-token document — is worth flagging. Retried failures at the same token size, suggesting an automated loop that doesn’t back off on errors. Fan-out patterns where one user triggers a dozen parallel requests in a tight window. None of these is proof of malice; all of them warrant a second look. A simple anomaly alert — “this key consumed more tokens in the last five minutes than its entire previous history” — costs almost nothing to build and catches both attacks and bugs.

The agentic multiplier

Everything above assumes a single request-response exchange. Agents break that assumption in the worst way.

A tool-using agent in an agentic framework can loop. Each iteration of the think-act-observe cycle is another completion call. An agent instructed to “research this topic thoroughly” on an unrestricted task can spin for dozens of turns, calling search tools, summarizing results, expanding scope, each turn consuming tokens. If the agent is also permitted to spawn sub-agents, the spend multiplies again.

The defense is a budget envelope per session or per agent run — not per request. Decide in advance: this task gets a maximum of 50,000 tokens or $0.10, whichever comes first. Track cumulative spend across the run and halt the agent before the envelope is exhausted if the task doesn’t seem to be converging. This is not just a cost control; it is also a correctness control. An agent that has consumed ten times the expected budget for a task has almost certainly gone wrong somewhere.

Insuring the wallet

The security industry built an entire discipline around availability: DDoS mitigation, CDN edge defense, SYN flood protection, load shedding. The reasoning was simple — downtime costs money, downtime damages trust, downtime is visible. Every engineering team understood the threat model.

The metered-AI era has added a second axis of harm that is just as real and far less understood. Your service stays up. Your users notice nothing unusual. The damage is entirely financial, entirely invisible until the billing alert arrives, and entirely avoidable if the controls were in place before the incident. You would not put a server on the public internet without a DDoS policy. Don’t put an LLM endpoint there without a spend policy.

The token-bucket math and the full defense playbook — including the gateway implementation patterns and the circuit-breaker wiring — are worked through in depth in our Generative AI → Systems Design at Scale → Rate Limiting and Cost Control lesson. The billing alert at 2 a.m. is optional.