Find the kth largest element in an unsorted array without fully sorting it.

Maintain a min-heap of size k. Stream every element through: push it onto the heap, then if the heap exceeds size k, pop the minimum. After processing all elements, the heap's minimum is the kth largest — it is the smallest among the top-k values seen so far.

What are the main sampling methods and how can sampling introduce bias?

The main probability sampling methods are simple random sampling, stratified sampling, cluster sampling, and systematic sampling. Bias enters when some units have a zero or systematically different probability of selection — as in convenience sampling, survivorship bias, or non-response bias — making the sample unrepresentative of the target population regardless of size.

What is bootstrapping, and when should you use resampling methods?

Bootstrapping estimates the sampling distribution of a statistic by repeatedly drawing samples with replacement from the observed data and computing the statistic on each resample. It works when the analytic sampling distribution is unknown, intractable, or the sample size is too small for asymptotic approximations to hold.

Return the k most frequent elements in an array.

Count frequencies with a hash map, then use a min-heap of size k to track the top k elements in O(n log k) time. An alternative bucket-sort approach achieves O(n) by indexing buckets by frequency.

Sampling & Reservoir Sampling — DSA

What you'll learn

Why naive random sampling breaks when you don't know n in advance

Algorithm R: keep the first k items, then accept item i with probability k/i

The short proof that every item ends with the same chance, k/n

Real uses: log sampling, balanced training sets, A/B exposure, stream downsampling

You are reading a 500 GB log file and you want 1,000 random rows — a perfectly uniform sample. You cannot fit the file in memory, and you do not even know how many rows it has. How do you do it?

This is the reservoir-sampling problem, and it has a small, beautiful one-pass answer. The constraints are sharp: keep exactly k items at all times, give every item an equal final chance, use only O(k) memory, and take a single pass with no rewind. The obvious ideas both fail — skipping items with a fixed probability needs n up front, and collecting everything to sample at the end blows the memory budget.

Algorithm R

Vitter’s Algorithm R is four lines of logic: fill the reservoir with the first k items; then, for each later item i, pick a random integer j in [0, i], and if j < k, overwrite slot j with the new item. Otherwise drop it. The new item is thus accepted with probability exactly k/i — less and less likely as the stream grows, which is precisely right.

Later items are accepted ever more rarely (k/i shrinks), which is exactly what keeps every item’s final chance equal.

import random

def reservoir_sample(stream, k):
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:
            reservoir.append(item)            # fill the first k
        else:
            j = random.randint(0, i)          # inclusive on both ends
            if j < k:
                reservoir[j] = item            # accept, evict slot j
    return reservoir

That is the whole algorithm: one pass, O(k) space, and n never needs to be known.

Why it is uniform

Fix any item at position p and ask for its probability of ending in the final reservoir. If it is one of the first k, it starts inside (probability 1) and must merely survive. Each later item i is accepted with probability k/i and, if accepted, evicts one of the k slots uniformly — so it lands on our slot with probability (k/i)·(1/k) = 1/i, meaning our item survives step i with probability (i−1)/i. Multiply those survival odds from k+1 to n and the fractions telescope — each numerator cancels the previous denominator:

(k/(k+1)) · ((k+1)/(k+2)) · … · ((n-1)/n)  =  k/n

A late-arriving item (position p > k) enters with probability k/p and then survives by the same telescoping product to n, which again multiplies out to k/n. So every item — early or late — ends with probability exactly k/n. The sample is uniform. You can watch this hold up empirically:

from collections import Counter
import random

def reservoir_sample(stream, k):
    res = []
    for i, item in enumerate(stream):
        if i < k:
            res.append(item)
        else:
            j = random.randint(0, i)
            if j < k:
                res[j] = item
    return res

counts = Counter()
for _ in range(10_000):                         # many trials
    counts.update(reservoir_sample(range(10), 3))
# every item should land in ~30% of samples, since k/n = 3/10

Run it and each of the ten items appears in roughly 3,000 of the 10,000 samples — a frequency near 0.30 for every one of them, early and late alike, with the small deviations shrinking as you add trials. (The exact counts differ run to run, because the sampling is random; what is guaranteed is that they cluster around k/n.)

Where it earns its keep

Property	Value
Time	O(n) — one pass, O(1) per item
Space	O(k) — just the reservoir
Passes	1, and `n` need not be known

This is the answer to “how do I sample uniformly when I can’t hold it all in memory?” — a 500 GB log, a Kafka topic of ten billion events, an S3 file too large to download. You stream it line by line and keep a uniform k-item reservoir throughout: building a balanced training subset on the fly, assigning A/B buckets as users arrive, or downsampling a stream for a dashboard — all the same trick.

Practice

Quick check

0/3

Q1Reservoir sampling with k=5, 20 items processed. The 21st arrives. Probability it enters the reservoir?

Q2A size-10 reservoir is full; a new item is accepted. Which slot does it replace?

Q3Why prefer reservoir sampling over 'collect everything, then sample' for huge streams?

Sampling & Reservoir Sampling

What you'll learn

Before you start

Algorithm R

Why it is uniform

Where it earns its keep

Practice

Quick check

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further