Sampling & Reservoir Sampling
How to draw a perfectly uniform random sample of k items from a stream you cannot load into memory — in one pass, O(n) time, O(k) space.
What you'll learn
- Why naive random sampling breaks when you don't know n in advance
- Algorithm R: keep the first k items, then accept each new item with probability k/i
- The probability argument that guarantees every item ends with equal chance k/n
- Real uses: log row sampling, balanced ML training sets, online A/B exposure, Kafka downsampling
Before you start
You are reading a log file that is 500 GB. You want 1,000 random rows — a perfectly uniform sample. You cannot load the file into memory. You do not know how many rows it has. How do you do it?
This is the reservoir sampling problem, and it has an elegant one-pass solution.
The problem, precisely
Given a stream of items arriving one at a time:
- You want to keep exactly k items at all times.
- When the stream ends, each item must have had an equal probability of being in the final sample.
- You have O(k) memory — you cannot buffer the whole stream.
- You get one pass — you cannot rewind.
Naive approaches fail. If you skip items with a fixed probability p, you do not know p up front because you do not know n. If you collect everything and sample at the end, you violate the memory constraint.
Algorithm R
Vitter’s Algorithm R (1985) solves this in four lines of logic:
- Fill the reservoir with the first k items.
- For item number i (where
i > k), generate a random integer j in[0, i](inclusive, 0-indexed). - If
j < k, replacereservoir[j]with the new item. - Otherwise discard the new item and continue.
The acceptance probability for item i is k/i. As the stream grows, each new item is progressively less likely to displace a current reservoir member — exactly as it should be.
import random
def reservoir_sample(stream, k):
reservoir = []
for i, item in enumerate(stream):
if i < k:
reservoir.append(item)
else:
j = random.randint(0, i) # inclusive on both ends
if j < k:
reservoir[j] = item
return reservoir
That is the complete algorithm. One pass, O(k) space, no knowledge of n required.
Why it is uniform: the short proof
The intuition first: at step i we want every one of the i items seen so far to have an equal k/i chance of being in the reservoir. Accepting the new item with probability exactly k/i and evicting a uniformly random existing slot is the only rule that preserves this invariant from step i to step i+1.
Fix any single item at position p (1-indexed, p <= n). What is the probability it ends up in the final reservoir of size k?
Case p <= k (initial items): The item enters the reservoir immediately (probability 1). It survives as long as it is never evicted. For each later item at position i (i > k), the new item is accepted with probability k/i and, if accepted, evicts a uniformly random slot — so the probability it targets our specific slot is (k/i) * (1/k) = 1/i. The probability our item survives position i is therefore (i-1)/i. Multiplying across all positions from k+1 to n (a telescoping product where each numerator cancels the previous denominator):
P(survive) = product_{i=k+1}^{n} (i-1)/i
= (k/(k+1)) * ((k+1)/(k+2)) * ... * ((n-1)/n)
= k/n
Case p > k (late items): The item is accepted with probability k/p. Once in the reservoir, it survives each subsequent position i by the same telescoping product, giving:
P(in final) = (k/p) * product_{i=p+1}^{n} (i-1)/i = k/n
Every item, regardless of when it arrived, has final probability k/n. The sample is uniform.
The empirical check
Run this code. It samples k=3 items from a stream of 10 elements 10,000 times and prints each element’s selection frequency. Every element should appear in roughly 30% of runs (k/n = 3/10).
Every row’s frequency should hover near 0.30. The deviations shrink as you increase trials. This is the algorithm’s guarantee made visible.
Why this matters for data and ML
Sampling rows from a giant log or stream is the canonical use case. A 500 GB Nginx log, a Kafka topic with 10 billion events, an S3 file too large to download — you can stream them line by line and maintain a uniform k-item reservoir throughout.
Building a balanced training subset on the fly is another common need. When fitting a model on data arriving from a pipeline, you can keep a representative reservoir of training examples without ever materializing the full dataset.
Online A/B exposure sampling uses the same idea: assign users to experiment buckets as they arrive, without knowing the total population in advance.
Downsampling Kafka streams for dashboards or feature stores: consume at full speed, emit only the reservoir items downstream, and the sample is provably uniform across the whole stream window.
Complexity recap
| Property | Value |
|---|---|
| Time | O(n) — one pass, O(1) work per item |
| Space | O(k) — only the reservoir is kept |
| Passes | 1 |
| Requires knowing n | No |
For k << n (the common case), the space saving over loading the full stream is enormous.
Quick check
Quick check
Practice this in an interview
All questionsMaintain a min-heap of size k. Stream every element through: push it onto the heap, then if the heap exceeds size k, pop the minimum. After processing all elements, the heap's minimum is the kth largest — it is the smallest among the top-k values seen so far.
The main probability sampling methods are simple random sampling, stratified sampling, cluster sampling, and systematic sampling. Bias enters when some units have a zero or systematically different probability of selection — as in convenience sampling, survivorship bias, or non-response bias — making the sample unrepresentative of the target population regardless of size.
Bootstrapping estimates the sampling distribution of a statistic by repeatedly drawing samples with replacement from the observed data and computing the statistic on each resample. It works when the analytic sampling distribution is unknown, intractable, or the sample size is too small for asymptotic approximations to hold.
Count frequencies with a hash map, then use a min-heap of size k to track the top k elements in O(n log k) time. An alternative bucket-sort approach achieves O(n) by indexing buckets by frequency.