datarekha

Circuit breakers & resilience

Your database (or model provider) goes down mid-query. How timeouts, retries, circuit breakers, and fallbacks stop one failure from taking down everything.

9 min read Advanced Generative AI Lesson 24 of 24

What you'll learn

  • How one slow dependency cascades into total failure
  • Timeouts and retries with exponential backoff + jitter (and a retry budget)
  • The circuit-breaker state machine: closed, open, half-open
  • Fallbacks and graceful degradation instead of hanging
  • What actually happens when a database dies mid-query

Before you start

Picture the circuit breaker on your home’s fuse board. When a fault draws too much current, a small switch trips and cuts the circuit. The rest of the house keeps its lights on. Nothing explodes. When the fault is cleared, you flip it back and carry on.

Software circuit breakers do the same thing for distributed dependencies: databases, model-provider APIs, vector stores, downstream microservices. Without one, a single failing dependency can cascade silently until your entire service is down — even the parts that never touched that dependency.

Why one failure takes out everything

Imagine an LLM-powered search endpoint. Every request does three things: fetch context from a Postgres database, call the model provider, then return. Now the database goes slow — not down, just slow. Each request sits there waiting. Your web framework hands the request a worker thread (or a connection from a pool). That thread is now occupied, waiting.

New requests keep arriving. Each gets a thread. Each thread parks, waiting on Postgres. Within seconds, the connection pool is exhausted. Now requests that would have been fine — maybe an unrelated health-check endpoint, a cache-only read, a static asset — queue behind the flood of Postgres-waiting threads. The queue grows. Memory climbs. Timeouts fire everywhere.

One slow downstream dependency has consumed the entire resource pool and made the whole service unresponsive. This is a cascading failure: a localised fault that propagates outward to consume healthy parts of the system.

Without a circuit breakerYour Apppool exhausted ✗thread waiting…thread waiting…thread waiting…thread waiting…Database✗ down / slowRequests keep piling up waiting. Pool fills.Unrelated endpoints also fail. Total outage.With a circuit breakerYour Apphealthy ✓BREAKEROPENDB✗ downnot called503 + Retry-After(or cached fallback)Breaker is OPEN. Call is not made.Fail fast. App stays up. Fallback served.
Without a breaker, slow dependency exhausts the whole pool and causes total failure. With a breaker open, calls fail immediately and the app stays healthy.

Step 1: Timeouts — never wait forever

The simplest protection is a timeout on every external call. A connect timeout caps how long you’ll wait to establish the connection. A read timeout caps how long you’ll wait for a response after connecting. Together they put a hard upper bound on how long a thread can be held hostage by a slow dependency.

import httpx

# Both timeouts set — never omit the read timeout
resp = httpx.post(
    "https://api.example.com/query",
    json={"prompt": "..."},
    timeout=httpx.Timeout(connect=2.0, read=10.0),
)

Without a read timeout, a single hung upstream call can park a thread indefinitely. In a thread-pool server, N such calls — where N is your pool size — halts all request processing.

Step 2: Retries with exponential backoff and jitter

Some failures are transient: a brief network hiccup, a 429 rate-limit, a momentary blip. Retrying makes sense. But naive retries — immediately retrying every failure — make outages catastrophic.

If ten thousand clients hit a temporary error at the same time and all retry immediately, the dependency faces a wave ten thousand times bigger than normal traffic at the exact moment it is least able to handle it. This is the thundering herd (or retry storm). The dependency, which might have recovered in two seconds, is now buried under an amplified load and stays down far longer.

The solution is exponential backoff with jitter:

  • Exponential backoff: after attempt N, wait base * 2^N seconds before the next attempt (e.g., 1 s, 2 s, 4 s, 8 s, …). This gives the dependency time to recover.
  • Jitter: add a random offset to each wait (e.g., multiply by a uniform random number in [0.5, 1.5]). This de-synchronises clients — instead of 10,000 clients all retrying at exactly t=2 s, they spread across a window. The retry storm becomes a gentle drizzle.
  • Cap attempts: never retry more than 3-5 times on the same request. Beyond that the user is better served by a fast error.
  • Retry budget: cap retries as a fraction of traffic (e.g., retry traffic must not exceed 10% of normal traffic). This prevents a mass-failure event from tripling your load on an already-struggling service.
  • Retry only idempotent failures: retry on 429, 503, transient network errors. Do NOT retry on 400, 401, 404 (those will never succeed), and be careful with 500 (retrying a non-idempotent mutation might apply it twice — use idempotency keys to guard against that).
Retry strategies: naive vs backoff + jitterNaive: retry immediately (thundering herd)time0s1s2s3s4sretry stormall clients fire at once — dependency stays overwhelmedExponential backoff + jitter (spread out)time0s1s2s3s4s5sAAABBBCCCjitter spreads clients across time — dependency gets breathing room to recover
Naive retries concentrate load exactly when the dependency is most vulnerable. Exponential backoff + jitter spreads attempts over time, letting the service recover.

Step 3: The circuit breaker — the hero pattern

Timeouts and retries are good. But if a dependency is down for 30 seconds, every request during that window still waits for the timeout to fire before it fails. With a 10 s timeout, a 30 s outage means three full waves of request timeouts, with threads held for the full 10 s each time.

A circuit breaker makes failure instantaneous once you know the dependency is down. It wraps a dependency call with a state machine that has exactly three states.

Circuit Breaker State MachineCLOSEDcalls pass throughcounting failuresOPENfail fast immediatelydep not calledHALF-OPENtrial requests allowedprobing recoveryfailures ≥ threshold(e.g. 5 failures in 10 s)cooldown elapsed(e.g. 30 s)trial succeeds ✓reset failure counttrial fails ✗reset cooldown✓ normal operation
The circuit breaker state machine. CLOSED is normal. OPEN means fail fast (dependency not called). HALF-OPEN probes whether the dependency has recovered.

CLOSED is the default. Calls pass through to the dependency. The breaker counts failures within a rolling time window (e.g., “5 failures in the last 10 seconds”). As long as failures stay below the threshold, it stays closed.

When failures hit the threshold, the breaker trips to OPEN. Now every call immediately returns an error (or the fallback) without touching the dependency at all. This is fail fast: the caller gets an answer in microseconds rather than waiting for a timeout. The dependency gets silence — no more hammering from a client it can’t serve — which gives it room to recover.

After a cooldown period (e.g., 30 seconds), the breaker moves to HALF-OPEN and allows a small number of trial requests through. If they succeed, the dependency has recovered: the breaker resets to CLOSED. If they fail, the dependency is still struggling: the breaker jumps back to OPEN and restarts the cooldown.

The two key benefits:

  1. Fast failure is better UX than a 30-second hang. A 503 with a Retry-After header lets the client back off gracefully. An indefinite hang destroys perceived quality.
  2. Silence protects recovery. A database that is struggling to come back up doesn’t need 10,000 retries per second on top. The breaker’s OPEN state is that silence.

A minimal circuit breaker in Python

import time, threading
from enum import Enum

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, cooldown=30.0, trial_limit=2):
        self.threshold = failure_threshold
        self.cooldown = cooldown
        self.trial_limit = trial_limit
        self._state = State.CLOSED
        self._failures = 0
        self._opened_at = None
        self._trials = 0
        self._lock = threading.Lock()

    def call(self, fn, *args, fallback=None, **kwargs):
        with self._lock:
            state = self._state
            if state == State.OPEN:
                elapsed = time.monotonic() - self._opened_at
                if elapsed >= self.cooldown:
                    self._state = State.HALF_OPEN
                    self._trials = 0
                else:
                    return fallback() if callable(fallback) else fallback
            elif state == State.HALF_OPEN and self._trials >= self.trial_limit:
                # still probing; refuse extra calls
                return fallback() if callable(fallback) else fallback

        try:
            result = fn(*args, **kwargs)
            with self._lock:
                if self._state in (State.HALF_OPEN, State.CLOSED):
                    self._failures = 0
                    self._state = State.CLOSED
            return result
        except Exception:
            with self._lock:
                self._failures += 1
                if self._state == State.HALF_OPEN:
                    self._state = State.OPEN
                    self._opened_at = time.monotonic()
                elif self._failures >= self.threshold:
                    self._state = State.OPEN
                    self._opened_at = time.monotonic()
            raise

Production use: prefer a battle-tested library (tenacity + a custom breaker, or pybreaker, or the circuit-breaker built into service meshes like Istio/Envoy).

Step 4: Fallbacks and graceful degradation

When the breaker is open, don’t just return a bare 500. Degrade gracefully:

  • Cached/stale response: if you queried the model five minutes ago and have the result cached, return it with a staleness warning.
  • Cheaper/smaller model: if claude-opus-4 is down, fall back to claude-haiku-4 for non-critical paths.
  • Static default: a search endpoint can return “here are our top 10 results” while personalization is down.
  • 503 + Retry-After: if you genuinely have nothing to serve, tell the client exactly when to come back. This is infinitely more useful than a cryptic 500.

The goal is that a dependency failure should degrade quality, not cause complete unavailability.

Step 5: Bulkheads — isolate resource pools

Even with circuit breakers, a single shared connection pool is a risk. A bulkhead assigns separate resource pools to different dependencies (or different traffic classes). Named after the watertight compartments in a ship’s hull — one compartment floods, the others don’t.

In practice this means: Postgres gets its own connection pool (max 20 connections), the model-provider HTTP client has its own thread pool (max 10 workers), and the vector-store client has a third. An outage on the model provider saturates its pool and is refused new connections — it does not consume Postgres connections, and vice versa.

from concurrent.futures import ThreadPoolExecutor

# Each dependency gets its own executor — its own bulkhead
_db_pool       = ThreadPoolExecutor(max_workers=20)
_model_pool    = ThreadPoolExecutor(max_workers=10)
_vector_pool   = ThreadPoolExecutor(max_workers=8)

What actually happens when the database dies mid-query

This is the concrete scenario worth walking through step by step.

  1. The open transaction rolls back. Postgres implements ACID atomicity: if the connection is lost before COMMIT, the transaction never commits. No partial writes. The data is safe.

  2. The connection pool detects dead connections. Most pools (SQLAlchemy, psycopg3, pgbouncer) have a liveness check. Dead connections are removed from the pool; the pool tries to re-establish up to its configured minimum.

  3. The circuit breaker trips. Connection failures (or a flood of OperationalError exceptions) hit the failure threshold. The breaker opens. New queries fail fast with an error rather than waiting for timeouts.

  4. New requests get a fast 503 + Retry-After. Instead of hanging for 30 seconds watching a timeout, users see a clear error in milliseconds. The Retry-After header tells clients when to retry.

  5. Degraded reads continue where possible. If you have a read replica, a caching layer (Redis), or static fallback data, serve them. Writes queue (with a job queue) or are rejected with a clear error.

  6. Idempotency keys make client retries safe. When the database comes back and the client retries the failed mutation, an idempotency key (a UUID the client sends and the server records in a deduplicated-operations table) ensures the operation is applied exactly once — even if the first attempt’s outcome was uncertain.

  7. The breaker moves to HALF-OPEN. After the cooldown, trial queries go through. The pool establishes fresh connections. Trials succeed. The breaker closes. Service resumes.

Putting it together: the resilience stack

The patterns compose in layers. Innermost to outermost:

  1. Timeout — bound every call. No infinite waits.
  2. Retry with backoff + jitter + budget — recover from transient failures without amplifying load.
  3. Circuit breaker — after sustained failure, stop calling the dependency and fail fast.
  4. Fallback / graceful degradation — when the breaker is open, serve something useful.
  5. Bulkhead — isolate pools so one failure domain can’t consume resources from another.

Each layer catches a different failure mode. Together they mean that a dependency going down for 2 minutes causes a 2-minute degradation in that feature — not a 2-minute total outage.

Quick check

0/3
Q1The circuit breaker is OPEN and a new request arrives. What happens?
Q2A service has a 10-second timeout on DB calls but no circuit breaker. The DB goes down for 60 seconds. What is the likely consequence?
Q3You add exponential backoff but no jitter to your retry logic. Your service has 5,000 concurrent clients. The dependency fails at t=0. Why is this still dangerous?

Practice this in an interview

All questions
Tell me about a time a model or analysis you built failed or underperformed.

Interviewers ask this to test intellectual honesty, ownership, and how you learn from setbacks — not to embarrass you. The strongest answers name a real failure, explain the root cause clearly, describe what you did to fix or contain the damage, and articulate the lasting lesson you carried forward.

How do you safely roll back a model in production and what triggers a rollback?

A rollback reverts serving traffic to a known-good model version when the newly deployed model shows metric regression beyond a tolerance threshold. Safe rollback requires versioned model artifacts, traffic-routing control, and pre-defined automated or manual triggers — not ad hoc decisions under pressure.

How does CI/CD for ML differ from standard software CI/CD, and what stages should an ML pipeline include?

ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.

When and how should you trigger model retraining — scheduled vs. event-driven?

Scheduled retraining is simple and predictable but wastes compute when nothing has shifted and reacts slowly when drift is sudden. Event-driven retraining ties compute to evidence — a drift alarm, a performance threshold breach, or a data volume trigger — and is more efficient at scale. Most mature systems combine both.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content