datarekha

Sampling & Reservoir Sampling

How to draw a perfectly uniform random sample of k items from a stream you cannot load into memory — in one pass, O(n) time, O(k) space.

6 min read Intermediate Data Structures & Algorithms Lesson 31 of 32

What you'll learn

  • Why naive random sampling breaks when you don't know n in advance
  • Algorithm R: keep the first k items, then accept each new item with probability k/i
  • The probability argument that guarantees every item ends with equal chance k/n
  • Real uses: log row sampling, balanced ML training sets, online A/B exposure, Kafka downsampling

Before you start

You are reading a log file that is 500 GB. You want 1,000 random rows — a perfectly uniform sample. You cannot load the file into memory. You do not know how many rows it has. How do you do it?

This is the reservoir sampling problem, and it has an elegant one-pass solution.

The problem, precisely

Given a stream of items arriving one at a time:

  • You want to keep exactly k items at all times.
  • When the stream ends, each item must have had an equal probability of being in the final sample.
  • You have O(k) memory — you cannot buffer the whole stream.
  • You get one pass — you cannot rewind.

Naive approaches fail. If you skip items with a fixed probability p, you do not know p up front because you do not know n. If you collect everything and sample at the end, you violate the memory constraint.

Algorithm R

Vitter’s Algorithm R (1985) solves this in four lines of logic:

  1. Fill the reservoir with the first k items.
  2. For item number i (where i > k), generate a random integer j in [0, i] (inclusive, 0-indexed).
  3. If j < k, replace reservoir[j] with the new item.
  4. Otherwise discard the new item and continue.

The acceptance probability for item i is k/i. As the stream grows, each new item is progressively less likely to displace a current reservoir member — exactly as it should be.

import random

def reservoir_sample(stream, k):
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:
            reservoir.append(item)
        else:
            j = random.randint(0, i)   # inclusive on both ends
            if j < k:
                reservoir[j] = item
    return reservoir

That is the complete algorithm. One pass, O(k) space, no knowledge of n required.

Why it is uniform: the short proof

The intuition first: at step i we want every one of the i items seen so far to have an equal k/i chance of being in the reservoir. Accepting the new item with probability exactly k/i and evicting a uniformly random existing slot is the only rule that preserves this invariant from step i to step i+1.

Fix any single item at position p (1-indexed, p <= n). What is the probability it ends up in the final reservoir of size k?

Case p <= k (initial items): The item enters the reservoir immediately (probability 1). It survives as long as it is never evicted. For each later item at position i (i > k), the new item is accepted with probability k/i and, if accepted, evicts a uniformly random slot — so the probability it targets our specific slot is (k/i) * (1/k) = 1/i. The probability our item survives position i is therefore (i-1)/i. Multiplying across all positions from k+1 to n (a telescoping product where each numerator cancels the previous denominator):

P(survive) = product_{i=k+1}^{n} (i-1)/i
           = (k/(k+1)) * ((k+1)/(k+2)) * ... * ((n-1)/n)
           = k/n

Case p > k (late items): The item is accepted with probability k/p. Once in the reservoir, it survives each subsequent position i by the same telescoping product, giving:

P(in final) = (k/p) * product_{i=p+1}^{n} (i-1)/i = k/n

Every item, regardless of when it arrived, has final probability k/n. The sample is uniform.

The empirical check

Run this code. It samples k=3 items from a stream of 10 elements 10,000 times and prints each element’s selection frequency. Every element should appear in roughly 30% of runs (k/n = 3/10).

Every row’s frequency should hover near 0.30. The deviations shrink as you increase trials. This is the algorithm’s guarantee made visible.

Why this matters for data and ML

Sampling rows from a giant log or stream is the canonical use case. A 500 GB Nginx log, a Kafka topic with 10 billion events, an S3 file too large to download — you can stream them line by line and maintain a uniform k-item reservoir throughout.

Building a balanced training subset on the fly is another common need. When fitting a model on data arriving from a pipeline, you can keep a representative reservoir of training examples without ever materializing the full dataset.

Online A/B exposure sampling uses the same idea: assign users to experiment buckets as they arrive, without knowing the total population in advance.

Downsampling Kafka streams for dashboards or feature stores: consume at full speed, emit only the reservoir items downstream, and the sample is provably uniform across the whole stream window.

Complexity recap

PropertyValue
TimeO(n) — one pass, O(1) work per item
SpaceO(k) — only the reservoir is kept
Passes1
Requires knowing nNo

For k << n (the common case), the space saving over loading the full stream is enormous.

Quick check

Quick check

0/4
Q1You are running reservoir sampling with k=5 and have processed 20 items so far. The 21st item arrives. What is the probability it is added to the reservoir?
Q2A reservoir of size k=10 has been filled. A new item arrives and is accepted (probability 10/i). Which slot does it replace?
Q3Why is reservoir sampling preferred over 'collect everything, then sample' for large streams?
Q4You adapt reservoir sampling to sample k=1 item uniformly from a stream (the so-called 'random selection' problem). The 5th item arrives. What acceptance probability keeps the sample uniform?

Practice this in an interview

All questions

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content