What is the difference between batch and streaming data pipelines, and how do you choose between them?

Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.

Compare Parquet, CSV, and Avro as big-data file formats — when do you use each?

Parquet is a columnar, compressed format optimized for analytical reads — only the queried columns are scanned. Avro is row-oriented, schema-embedded, and optimized for write-heavy pipelines and Kafka serialization. CSV is human-readable but schema-less, uncompressed, and slow at scale — use it only at system boundaries where a downstream tool requires it.

What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

What are the differences between batch, online, and streaming inference, and when should you use each?

Batch inference runs predictions on large datasets on a schedule, optimizing for throughput. Online inference serves individual requests in real time, optimizing for low latency. Streaming inference processes continuous event streams with bounded latency requirements between the two extremes.

Bloom Filters, HyperLogLog & MinHash-LSH — DSA

What you'll learn

How a Bloom filter guarantees no false negatives while keeping its false-positive rate tunable

How HyperLogLog counts billions of distinct values in a few kilobytes

How MinHash estimates set similarity without comparing every pair

Where these run in production: dedup, cardinality analytics, near-duplicate detection

The data engineer’s quiet motto: exactly correct costs a lot; probably correct often costs almost nothing.

Probabilistic data structures spend a small, bounded, tunable amount of inaccuracy to buy savings that are otherwise impossible — structures that fit in kilobytes instead of gigabytes, or finish in milliseconds instead of hours. Three of them turn up everywhere, and together they answer the three core questions of scale-out data work: have I seen this before?, how many distinct things have I seen?, and which things are nearly the same?

Bloom filters: probably-in, definitely-out

A Bloom filter answers one question — is this item in the set? — but only in two of the three possible ways. It can say “definitely not” or “probably yes”. It can never say “definitely yes”, and that asymmetry is the whole trick.

The structure is a bit array of m zeros plus k independent hash functions. To add an item, hash it k ways and set each of those bits to 1. To query, hash it the same k ways and read those bits: if any is 0, the item was certainly never added (a 0 can only survive if nothing set it); if all are 1, the item is probably present — though those bits might have been set by other items, which is a false positive.

”cat” sets three bits. Querying “dog” hashes to its own three bits; if even one is still 0, “dog” was definitely never added.

The false-positive rate is tunable through m, n (items inserted), and k, with the optimum at k = (m/n)·ln 2. The payoff is dramatic: a Bloom filter for one million items at a 1% false-positive rate needs about 9.6 million bits — roughly 1.2 MB — against many gigabytes for an exact hash set of the same keys. The cost is that it cannot delete, cannot list its members, and only ever answers membership. The structure is small enough to keep in code as a sketch:

import math

class BloomFilter:
    def __init__(self, capacity, fp_rate):
        self.m = math.ceil(-capacity * math.log(fp_rate) / (math.log(2) ** 2))
        self.k = max(1, round((self.m / capacity) * math.log(2)))
        self.bits = [0] * self.m
    def _positions(self, item):
        # k independent positions (a real one uses k good hash functions)
        return [hash((seed, item)) % self.m for seed in range(self.k)]
    def add(self, item):
        for p in self._positions(item):
            self.bits[p] = 1
    def __contains__(self, item):
        return all(self.bits[p] for p in self._positions(item))

Whatever you add will always test as present — there are no false negatives, by construction. Only a never-added item can ever wrongly test present, and only at the rate you chose.

HyperLogLog: counting distinct in kilobytes

How many distinct IPs hit the server today? Counting distinct values exactly means remembering every distinct value — gigabytes at scale. HyperLogLog estimates that count to within about 1-2% using a few kilobytes, no matter how many items it sees. The trick: hash each item to a uniform bit string and note the length of its leading run of zeros. A run of k zeros happens with probability 1/2ᵏ, so seeing a run of length k hints that you have processed on the order of 2ᵏ distinct items. Average that signal across many independent buckets to tame the noise, and a 12 KB register array can report “about 4.3 billion distinct values” with ~1.4% error. This is what PFCOUNT in Redis, approx_count_distinct in Spark, and APPROX_COUNT_DISTINCT in BigQuery all run underneath.

MinHash + LSH: near-duplicates without all-pairs

Which of a million documents are nearly identical? Comparing every pair is O(n²) — hopeless. MinHash compresses each document’s set of shingles into a short signature: for each of t hash functions, keep the minimum hash over the set. The lovely fact is that the probability two signatures agree in a position equals the documents’ Jaccard similarity (shared shingles over total shingles) — so the fraction of matching positions estimates similarity directly, and 200 integers stand in for millions of shingles. LSH then avoids the O(n²) pair comparison by banding: split each 200-hash signature into, say, 20 bands of 10, and hash each band to a bucket. Two documents become candidate pairs only if they collide in at least one band — which happens with sharply rising probability above a similarity threshold. You compare a small candidate set instead of every pair.

Structure	Answers	Guarantee	Typical error	Memory
Bloom filter	Is X in the set?	No false negatives	Tunable FP rate	~10 bits/item at 1%
HyperLogLog	How many distinct?	Approximate	~1-2%	~12 KB regardless of n
MinHash-LSH	Which are near-duplicates?	Approximate	Tunable threshold	signature × n

Practice

Quick check

0/3

Q1A Bloom filter reports an item is 'possibly present'. What must be true?

Q2You must count distinct user IDs over a 10-billion-row log; an exact set would be 40 GB. Best alternative?

Q3A crawler uses a Bloom filter to skip visited URLs, and after weeks starts re-crawling some it already saw. No code changed. Likeliest cause?

Bloom Filters, HyperLogLog & MinHash-LSH

What you'll learn

Before you start

Bloom filters: probably-in, definitely-out

HyperLogLog: counting distinct in kilobytes

MinHash + LSH: near-duplicates without all-pairs

Practice

Quick check

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further