Circuit breakers & resilience
Your database (or model provider) goes down mid-query. How timeouts, retries, circuit breakers, and fallbacks stop one failure from taking down everything.
What you'll learn
- How one slow dependency cascades into total failure
- Timeouts and retries with exponential backoff + jitter (and a retry budget)
- The circuit-breaker state machine: closed, open, half-open
- Fallbacks and graceful degradation instead of hanging
- What actually happens when a database dies mid-query
Before you start
Picture the circuit breaker on your home’s fuse board. When a fault draws too much current, a small switch trips and cuts the circuit. The rest of the house keeps its lights on. Nothing explodes. When the fault is cleared, you flip it back and carry on.
Software circuit breakers do the same thing for distributed dependencies: databases, model-provider APIs, vector stores, downstream microservices. Without one, a single failing dependency can cascade silently until your entire service is down — even the parts that never touched that dependency.
Why one failure takes out everything
Imagine an LLM-powered search endpoint. Every request does three things: fetch context from a Postgres database, call the model provider, then return. Now the database goes slow — not down, just slow. Each request sits there waiting. Your web framework hands the request a worker thread (or a connection from a pool). That thread is now occupied, waiting.
New requests keep arriving. Each gets a thread. Each thread parks, waiting on Postgres. Within seconds, the connection pool is exhausted. Now requests that would have been fine — maybe an unrelated health-check endpoint, a cache-only read, a static asset — queue behind the flood of Postgres-waiting threads. The queue grows. Memory climbs. Timeouts fire everywhere.
One slow downstream dependency has consumed the entire resource pool and made the whole service unresponsive. This is a cascading failure: a localised fault that propagates outward to consume healthy parts of the system.
Step 1: Timeouts — never wait forever
The simplest protection is a timeout on every external call. A connect timeout caps how long you’ll wait to establish the connection. A read timeout caps how long you’ll wait for a response after connecting. Together they put a hard upper bound on how long a thread can be held hostage by a slow dependency.
import httpx
# Both timeouts set — never omit the read timeout
resp = httpx.post(
"https://api.example.com/query",
json={"prompt": "..."},
timeout=httpx.Timeout(connect=2.0, read=10.0),
)
Without a read timeout, a single hung upstream call can park a thread indefinitely. In a thread-pool server, N such calls — where N is your pool size — halts all request processing.
Step 2: Retries with exponential backoff and jitter
Some failures are transient: a brief network hiccup, a 429 rate-limit, a momentary blip. Retrying makes sense. But naive retries — immediately retrying every failure — make outages catastrophic.
If ten thousand clients hit a temporary error at the same time and all retry immediately, the dependency faces a wave ten thousand times bigger than normal traffic at the exact moment it is least able to handle it. This is the thundering herd (or retry storm). The dependency, which might have recovered in two seconds, is now buried under an amplified load and stays down far longer.
The solution is exponential backoff with jitter:
- Exponential backoff: after attempt N, wait
base * 2^Nseconds before the next attempt (e.g., 1 s, 2 s, 4 s, 8 s, …). This gives the dependency time to recover. - Jitter: add a random offset to each wait (e.g., multiply by a uniform random number in
[0.5, 1.5]). This de-synchronises clients — instead of 10,000 clients all retrying at exactly t=2 s, they spread across a window. The retry storm becomes a gentle drizzle. - Cap attempts: never retry more than 3-5 times on the same request. Beyond that the user is better served by a fast error.
- Retry budget: cap retries as a fraction of traffic (e.g., retry traffic must not exceed 10% of normal traffic). This prevents a mass-failure event from tripling your load on an already-struggling service.
- Retry only idempotent failures: retry on 429, 503, transient network errors. Do NOT retry on 400, 401, 404 (those will never succeed), and be careful with 500 (retrying a non-idempotent mutation might apply it twice — use idempotency keys to guard against that).
Step 3: The circuit breaker — the hero pattern
Timeouts and retries are good. But if a dependency is down for 30 seconds, every request during that window still waits for the timeout to fire before it fails. With a 10 s timeout, a 30 s outage means three full waves of request timeouts, with threads held for the full 10 s each time.
A circuit breaker makes failure instantaneous once you know the dependency is down. It wraps a dependency call with a state machine that has exactly three states.
CLOSED is the default. Calls pass through to the dependency. The breaker counts failures within a rolling time window (e.g., “5 failures in the last 10 seconds”). As long as failures stay below the threshold, it stays closed.
When failures hit the threshold, the breaker trips to OPEN. Now every call immediately returns an error (or the fallback) without touching the dependency at all. This is fail fast: the caller gets an answer in microseconds rather than waiting for a timeout. The dependency gets silence — no more hammering from a client it can’t serve — which gives it room to recover.
After a cooldown period (e.g., 30 seconds), the breaker moves to HALF-OPEN and allows a small number of trial requests through. If they succeed, the dependency has recovered: the breaker resets to CLOSED. If they fail, the dependency is still struggling: the breaker jumps back to OPEN and restarts the cooldown.
The two key benefits:
- Fast failure is better UX than a 30-second hang. A 503 with a
Retry-Afterheader lets the client back off gracefully. An indefinite hang destroys perceived quality. - Silence protects recovery. A database that is struggling to come back up doesn’t need 10,000 retries per second on top. The breaker’s OPEN state is that silence.
A minimal circuit breaker in Python
import time, threading
from enum import Enum
class State(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, cooldown=30.0, trial_limit=2):
self.threshold = failure_threshold
self.cooldown = cooldown
self.trial_limit = trial_limit
self._state = State.CLOSED
self._failures = 0
self._opened_at = None
self._trials = 0
self._lock = threading.Lock()
def call(self, fn, *args, fallback=None, **kwargs):
with self._lock:
state = self._state
if state == State.OPEN:
elapsed = time.monotonic() - self._opened_at
if elapsed >= self.cooldown:
self._state = State.HALF_OPEN
self._trials = 0
else:
return fallback() if callable(fallback) else fallback
elif state == State.HALF_OPEN and self._trials >= self.trial_limit:
# still probing; refuse extra calls
return fallback() if callable(fallback) else fallback
try:
result = fn(*args, **kwargs)
with self._lock:
if self._state in (State.HALF_OPEN, State.CLOSED):
self._failures = 0
self._state = State.CLOSED
return result
except Exception:
with self._lock:
self._failures += 1
if self._state == State.HALF_OPEN:
self._state = State.OPEN
self._opened_at = time.monotonic()
elif self._failures >= self.threshold:
self._state = State.OPEN
self._opened_at = time.monotonic()
raise
Production use: prefer a battle-tested library (tenacity + a custom breaker, or pybreaker, or the circuit-breaker built into service meshes like Istio/Envoy).
Step 4: Fallbacks and graceful degradation
When the breaker is open, don’t just return a bare 500. Degrade gracefully:
- Cached/stale response: if you queried the model five minutes ago and have the result cached, return it with a staleness warning.
- Cheaper/smaller model: if
claude-opus-4is down, fall back toclaude-haiku-4for non-critical paths. - Static default: a search endpoint can return “here are our top 10 results” while personalization is down.
- 503 +
Retry-After: if you genuinely have nothing to serve, tell the client exactly when to come back. This is infinitely more useful than a cryptic 500.
The goal is that a dependency failure should degrade quality, not cause complete unavailability.
Step 5: Bulkheads — isolate resource pools
Even with circuit breakers, a single shared connection pool is a risk. A bulkhead assigns separate resource pools to different dependencies (or different traffic classes). Named after the watertight compartments in a ship’s hull — one compartment floods, the others don’t.
In practice this means: Postgres gets its own connection pool (max 20 connections), the model-provider HTTP client has its own thread pool (max 10 workers), and the vector-store client has a third. An outage on the model provider saturates its pool and is refused new connections — it does not consume Postgres connections, and vice versa.
from concurrent.futures import ThreadPoolExecutor
# Each dependency gets its own executor — its own bulkhead
_db_pool = ThreadPoolExecutor(max_workers=20)
_model_pool = ThreadPoolExecutor(max_workers=10)
_vector_pool = ThreadPoolExecutor(max_workers=8)
What actually happens when the database dies mid-query
This is the concrete scenario worth walking through step by step.
-
The open transaction rolls back. Postgres implements ACID atomicity: if the connection is lost before
COMMIT, the transaction never commits. No partial writes. The data is safe. -
The connection pool detects dead connections. Most pools (SQLAlchemy, psycopg3, pgbouncer) have a liveness check. Dead connections are removed from the pool; the pool tries to re-establish up to its configured minimum.
-
The circuit breaker trips. Connection failures (or a flood of
OperationalErrorexceptions) hit the failure threshold. The breaker opens. New queries fail fast with an error rather than waiting for timeouts. -
New requests get a fast 503 +
Retry-After. Instead of hanging for 30 seconds watching a timeout, users see a clear error in milliseconds. TheRetry-Afterheader tells clients when to retry. -
Degraded reads continue where possible. If you have a read replica, a caching layer (Redis), or static fallback data, serve them. Writes queue (with a job queue) or are rejected with a clear error.
-
Idempotency keys make client retries safe. When the database comes back and the client retries the failed mutation, an idempotency key (a UUID the client sends and the server records in a deduplicated-operations table) ensures the operation is applied exactly once — even if the first attempt’s outcome was uncertain.
-
The breaker moves to HALF-OPEN. After the cooldown, trial queries go through. The pool establishes fresh connections. Trials succeed. The breaker closes. Service resumes.
Putting it together: the resilience stack
The patterns compose in layers. Innermost to outermost:
- Timeout — bound every call. No infinite waits.
- Retry with backoff + jitter + budget — recover from transient failures without amplifying load.
- Circuit breaker — after sustained failure, stop calling the dependency and fail fast.
- Fallback / graceful degradation — when the breaker is open, serve something useful.
- Bulkhead — isolate pools so one failure domain can’t consume resources from another.
Each layer catches a different failure mode. Together they mean that a dependency going down for 2 minutes causes a 2-minute degradation in that feature — not a 2-minute total outage.
Quick check
Practice this in an interview
All questionsInterviewers ask this to test intellectual honesty, ownership, and how you learn from setbacks — not to embarrass you. The strongest answers name a real failure, explain the root cause clearly, describe what you did to fix or contain the damage, and articulate the lasting lesson you carried forward.
A rollback reverts serving traffic to a known-good model version when the newly deployed model shows metric regression beyond a tolerance threshold. Safe rollback requires versioned model artifacts, traffic-routing control, and pre-defined automated or manual triggers — not ad hoc decisions under pressure.
ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.
Scheduled retraining is simple and predictable but wastes compute when nothing has shifted and reacts slowly when drift is sudden. Event-driven retraining ties compute to evidence — a drift alarm, a performance threshold breach, or a data volume trigger — and is more efficient at scale. Most mature systems combine both.