datarekha

Bloom Filters, HyperLogLog & MinHash-LSH

Trade a little accuracy for enormous space and time savings — the probabilistic data structures every data engineer should have in their toolkit.

9 min read Advanced Data Structures & Algorithms Lesson 32 of 32

What you'll learn

  • How Bloom filters guarantee no false negatives while keeping false-positive rate tunable
  • How HyperLogLog counts billions of distinct values in kilobytes of memory
  • How MinHash estimates Jaccard similarity without all-pairs comparison
  • Where these structures appear in production: dedup, cardinality analytics, and near-duplicate detection

Before you start

The data-engineer’s mantra: exactly correct costs a lot; probably correct often costs almost nothing.

Probabilistic data structures exploit a deliberate, bounded inaccuracy to achieve space and time savings that are otherwise impossible. You trade a configurable error rate for structures that fit in kilobytes instead of gigabytes, or finish in milliseconds instead of hours.

Three of these structures show up constantly in production systems: Bloom filters, HyperLogLog, and MinHash with LSH. Each solves a different problem; together they cover the core data-engineering questions of have I seen this before?, how many distinct things have I seen?, and which things are nearly the same?

Bloom filters: probabilistic set membership

A Bloom filter answers one question: is this item in the set? It can answer “definitely not” or “probably yes”. It never answers “definitely yes” — and that asymmetry is the insight.

The structure

A Bloom filter is a bit array of m bits (all zeros initially) plus k independent hash functions — deterministic functions that map any input to a numbered slot in the array — one slot per function. To add an item, run it through all k hash functions and set each resulting bit to 1. To query an item, run the same k hash functions and check those bits:

  • Any bit is 0 → the item was definitely never added (a 0 can only exist if no prior item set that position)
  • All bits are 1 → the item was probably added (but all those bits could have been set by other items)

That second case — all bits 1, but the item was never actually added — is a false positive (a wrong “yes”). No false negatives are possible by construction: a bit that was never set cannot spontaneously become 1, so a “definitely not” answer is always correct.

Note: a Bloom filter is not error-free — it makes false-positive mistakes at a rate you tune via m and k.

Sizing a Bloom filter

The false-positive rate p is determined by three parameters:

  • m — number of bits
  • n — number of items inserted
  • k — number of hash functions

The approximation is:

p ≈ (1 - e^(-k·n/m))^k

The optimal number of hash functions for a given m and n is k = (m/n) · ln(2). In practice, you pick a target false-positive rate (say 1%), solve for m, and derive k. For 1 million items at 1% FPR you need about 9.6 million bits (~1.2 MB) — versus many gigabytes for an exact hash set of the same items.

What Bloom filters cannot do

They have no delete operation (setting a bit back to 0 would affect other items). Counting Bloom filters address this at the cost of extra space. They also cannot enumerate their members or return the items they contain — they only answer membership queries.

HyperLogLog: count distinct in kilobytes

How many distinct IP addresses hit your server today? How many unique users ran a query? Counting distinct values exactly requires storing every distinct value — at scale, that is gigabytes of memory and a full sort or hash-set pass.

HyperLogLog (HLL) estimates cardinality (the count-distinct) with a typical error of 1–2% using a few kilobytes of memory, regardless of how many items you have seen.

The core trick

HLL hashes each item to a uniform bit string and counts the length of the leading run of zeros before the first 1-bit. A run of length k occurs with probability 1/2^k, so if you have ever observed a run of length k, you have likely seen roughly 2^k distinct items before it appeared by chance. By maintaining a small array of “maximum leading zeros seen” buckets, HLL averages out the noise across many independent sub-estimators.

The result: a structure that can tell you “there are approximately 4.3 billion distinct values” from a 12 KB register array, with an error of around 1.4%.

HLL in practice

Redis has PFADD / PFCOUNT built in. Spark’s approx_count_distinct uses HLL under the hood. BigQuery’s APPROX_COUNT_DISTINCT does the same. Any time you see “approximate distinct count” in a data platform, it is almost certainly HLL.

# Spark: exact vs approximate distinct count
df.select(
    countDistinct("user_id").alias("exact"),         # full shuffle + hash
    approx_count_distinct("user_id", 0.02).alias("approx_hll"),  # HLL, 2% error
).show()

For a dataset of 100 million rows, countDistinct does a full shuffle and sort; approx_count_distinct stays in each partition’s memory and merges tiny HLL sketches — orders of magnitude faster with acceptable accuracy for analytics.

MinHash + LSH: find near-duplicates without all-pairs comparison

Given a corpus of millions of documents, which ones are nearly identical? The naive approach compares every pair — that is O(n²) and simply infeasible for large n. MinHash combined with Locality-Sensitive Hashing (LSH) finds near-duplicate candidates in near-linear time.

Jaccard similarity

The similarity metric here is Jaccard: for two sets A and B (think: sets of shingles, or n-grams, extracted from documents):

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

Two near-identical documents share most of their shingles → Jaccard close to 1. Two unrelated documents share almost none → Jaccard close to 0.

MinHash: compress a set into a signature

A MinHash signature compresses a large set into a short vector of integers. For each of t hash functions, record the minimum hash value over all elements of the set. The probability that two sets have the same minimum hash under any given function equals their Jaccard similarity:

P(minhash_i(A) == minhash_i(B)) = Jaccard(A, B)

So the fraction of matching signature positions estimates Jaccard directly. A signature of 200 integers (800 bytes) represents a document of millions of shingles.

LSH: turn signatures into candidate pairs

With MinHash signatures in hand, you still have O(n²) pairs to compare. LSH solves this by banding the signature matrix. Divide the 200-hash signature into 20 bands of 10 hashes each. Hash each band into a bucket. Two documents land in the same bucket for at least one band only if their signatures are locally similar — and the probability of a match rises sharply above a threshold Jaccard value.

The result: instead of comparing n² pairs, you only compare pairs that share at least one band bucket — a dramatically smaller set of candidates.

Running a tiny Bloom filter in Python

Comparing the three structures

StructureQuestion answeredGuaranteeTypical errorMemory
Bloom filterIs item X in the set?No false negativesTunable FP rate~10 bits / item at 1% FPR
HyperLogLogHow many distinct items?Approximate~1–2%~12 KB regardless of n
MinHash-LSHWhich items are near-duplicates?Approximate pairsTunable thresholdSignature size × n

Choosing between them

Use a Bloom filter when you need fast membership checks and can tolerate rare false positives — caches, dedup pipelines, prefilters before expensive lookups. Use HyperLogLog when you need count-distinct over a stream or distributed dataset and exact counts would require an unbounded hash set. Use MinHash-LSH when you have a large corpus and need to identify near-duplicate pairs — document dedup, dataset cleaning, RAG retrieval deduplication.

All three share a design philosophy: commit to a bounded, tunable error budget, and in exchange get structures that scale to billions of items with fixed or near-fixed memory.

Quick check

Quick check

0/4
Q1A Bloom filter reports that an item is 'possibly present'. Which of the following must be true?
Q2You are building an analytics pipeline that needs to count distinct user IDs over a 10-billion-row event log. Exact counts require a 40 GB hash set. What is the best alternative?
Q3MinHash-LSH speeds up near-duplicate detection by avoiding all-pairs comparison. How does the LSH (banding) step achieve this?
Q4A web crawler uses a Bloom filter to avoid re-crawling URLs. After weeks of operation it starts re-crawling some pages it already visited. No code changed. What is the most likely cause?

Practice this in an interview

All questions
What is the difference between batch and streaming data pipelines, and how do you choose between them?

Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.

Compare Parquet, CSV, and Avro as big-data file formats — when do you use each?

Parquet is a columnar, compressed format optimized for analytical reads — only the queried columns are scanned. Avro is row-oriented, schema-embedded, and optimized for write-heavy pipelines and Kafka serialization. CSV is human-readable but schema-less, uncompressed, and slow at scale — use it only at system boundaries where a downstream tool requires it.

What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

What are the differences between batch, online, and streaming inference, and when should you use each?

Batch inference runs predictions on large datasets on a schedule, optimizing for throughput. Online inference serves individual requests in real time, optimizing for low latency. Streaming inference processes continuous event streams with bounded latency requirements between the two extremes.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content