How would you prevent an AI agent from leaking or misusing API credentials?

Keep raw credentials outside model context and traces. Let the model propose typed intent, authorize the final action and arguments deterministically, then have a trusted executor inject a short-lived, narrowly scoped, audience-restricted credential for one call. Re-authorize downstream and gate high-impact writes with explicit approval.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

How would you defend an LLM application against prompt injection?

No single fix is complete, so defenses are layered: separate trusted instructions from untrusted data, constrain and least-privilege the tools and actions the model can take, validate and sanitize inputs and tool outputs, add output guardrails and injection classifiers, and keep a human in the loop for sensitive actions. Treat all external or retrieved content as untrusted.

What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

Rate limiting & denial-of-wallet — Generative AI

You built a product on a hosted LLM API. Every token your users send — and every token the model generates — appears on your bill. A classic denial-of-service attack tries to take your service down. A denial-of-wallet attack doesn’t bother. It just keeps sending expensive requests until your cloud bill hits five figures and your credit card declines. The service stays up. Your company goes down.

This lesson is the systems-design blueprint for stopping it.

What makes inference APIs uniquely vulnerable

With a traditional API, every request is roughly the same cost. You can reason: “100 requests per minute is safe.” With an LLM API, one request can cost 100x another. A 200-token chat message and a 128 000-token document summarization are both “one request.” The document costs two or three orders of magnitude more.

This breaks every naive rate-limiting strategy built around request counts.

The three dimensions you must limit

Rate limiting an LLM API requires enforcement on at least three axes, each keyed per API key (or per authenticated user):

Dimension	What it limits	Why it matters
RPM — requests per minute	Request count	Stops trivial hammering; baseline hygiene
TPM — tokens per minute	Input + output tokens combined	Directly proportional to cost; the key limit
Daily $ budget	Estimated spend	Hard financial circuit-breaker

Most cloud LLM providers expose all three. Your own gateway must enforce all three too, because a single provider-side limit is not enough when you’re aggregating across multiple users or models.

The token-bucket algorithm

The most widely used rate-limiting algorithm is the token bucket. Despite the name, the “tokens” here are capacity units — nothing to do with LLM tokens. The intuition is deliberately mechanical:

Imagine a bucket that holds up to C units (the capacity or burst limit).
The bucket refills at a constant rate of r units per second.
Every incoming request consumes some units from the bucket. A request of weight w is allowed only if the bucket currently holds at least w units; after allowing it, deduct w. If the bucket is empty (or below w), reject with HTTP 429 and a Retry-After header.

The refill math at any moment:

bucket = min(C, current + r × t)

where t is the number of seconds since the last check. Because you only ever compute the bucket on a request, you don’t need a background timer — you just compute how much would have accumulated.

Concrete example. C = 60, r = 1/s. A client that was idle for 30 seconds arrives with 30 units in the bucket (min(60, 0 + 1×30)). It can burst up to 60 requests in a short window, then is sustained at 1/s thereafter. A client hammering from the start burns 60 units instantly, then gets exactly 1 req/s after that.

Token bucket: refills at r tokens/s up to capacity C. Requests drain the bucket; an empty bucket triggers HTTP 429 with a Retry-After header.

Fixed-window vs sliding-window counters

When people first implement rate limiting without a token bucket, they often reach for a fixed-window counter: increment a counter keyed to the current minute (e.g. ratelimit:user123:2026-05-29T14:07) and reject if the counter exceeds the limit.

It’s simple and fast. It also has a critical flaw.

The boundary-burst problem

A fixed-window limit of 100 requests/minute does not guarantee fewer than 100 requests in any 60-second span. A client can:

Send 100 requests at 14:07:58 (end of window 14:07 — all pass).
Immediately send 100 more requests at 14:08:01 (start of window 14:08 — all pass).

Result: 200 requests in 6 seconds. The attacker effectively doubles your limit at every window boundary.

A sliding-window approach — either a log-based window (store a timestamp for each request, count those within the last 60 s) or a weighted approximation (blend the current-window count with a fraction of the previous-window count) — eliminates this artifact by keeping the count continuously up to date.

Fixed-window counters reset at wall-clock boundaries, letting a burst at t=:59 and another at t=:00 pass 2× the intended limit. Sliding windows track the rolling 60 s span and eliminate the artifact.

Leaky bucket — the third variant

The leaky bucket is the inverse of the token bucket: instead of accumulating capacity that requests drain, incoming requests fill a queue and are processed at a constant output rate. This smooths bursty input into a steady stream. It’s useful for outbound call shaping (e.g. your service calling a downstream API at a capped rate) but less common for admission control, where the token bucket’s burst allowance is usually preferable.

Layered defense

No single control is enough. A full defense applies all of the following:

RPM limit at the gateway. Stops the most trivial hammering and protects your infrastructure from connection exhaustion. Keyed per API key.
TPM / token budget at the gateway. The core anti-DoW control. Sum input + output tokens per request as they flow through; reject when the per-minute token budget is exceeded.
max_tokens cap per request. Pass a hard max_tokens parameter to the model on every call. This is your per-request firewall — it prevents any single request from generating a runaway multi-thousand-token response, even if the token budget hasn’t been hit yet.
Daily dollar cap + alerting. Set hard monthly/daily spend limits in your cloud provider dashboard. Wire cost alerts at 50%, 80%, and 100% of budget. This is your financial circuit-breaker — if everything else fails, the bill stops growing and you get paged.
Authentication. All of the above is useless if limits aren’t per-identity. Require an API key on every request; enforce limits keyed to that identity. Anonymous endpoints are rate-limited by IP, which is trivially bypassed.

Layered defense: a request must pass RPM, token budget, per-request max_tokens, and the daily spend cap before reaching the model. A denial-of-wallet attacker is neutralized at the token-budget gate.

Where and how to enforce

At the gateway, not the application. Your rate-limiting middleware should sit in front of your application logic, ideally in a dedicated reverse proxy or API gateway layer (nginx, Envoy, Kong, a custom FastAPI middleware). This ensures limits apply regardless of which backend instance handles the request.

Distributed counters in Redis. For multi-instance deployments, per-process in-memory counters won’t work — each instance sees only a slice of traffic. Use Redis with atomic increment (INCR / INCRBY) and a TTL for the expiry window. A sliding log uses ZADD + ZCOUNT on a sorted set of timestamps.

Return 429 with Retry-After. A bare 429 is sufficient for protection, but adding Retry-After: 60 lets well-behaved clients back off gracefully and retry rather than hammering indefinitely or giving up.

Estimate tokens before the call. For TPM limiting, you need to count input tokens before sending the request (use a library like tiktoken for OpenAI models or the provider’s counting endpoint). Output tokens are only known after the fact — account for them by tracking a rolling spend and including the previous request’s output in the next check.

Summary

Denial-of-wallet is the LLM-specific threat that makes your service unprofitable rather than unavailable. The defense is a set of layers, all enforced per identity at the gateway:

Token bucket for RPM/TPM enforcement — bursts allowed, sustained rate controlled.
Sliding window over fixed-window — no boundary-burst artifacts.
max_tokens on every request — per-call runaway prevention.
Daily $ cap + alerts — the financial circuit-breaker.
Auth everywhere — limits are meaningless without identity.

Each layer is cheap to add and cheap to operate. The cost of skipping any one of them is a bill you discover at midnight.

Quick check

0/3

Q1A token bucket is configured with C = 120 and r = 2 tokens/s. A client has been idle for 30 seconds, then sends 80 requests simultaneously. How many are allowed?

Q2Your inference gateway enforces a fixed-window limit of 100 requests per minute. An attacker sends 100 requests at 14:07:59 and another 100 at 14:08:01. How many requests pass in total?

Q3You enforce an RPM limit of 60 requests/minute but no token limit. An attacker sends 60 requests per minute, each with a 100 000-token input and requesting 4 000-token outputs. Why is this still a problem?

Rate limiting & denial-of-wallet

What you'll learn

Before you start

What makes inference APIs uniquely vulnerable

The three dimensions you must limit

The token-bucket algorithm

Fixed-window vs sliding-window counters

The boundary-burst problem

Leaky bucket — the third variant

Layered defense

Where and how to enforce

Summary

Quick check

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further