Data Transformation: Normalization, Discretization, Sampling, Compression
Get raw data ready for analysis — rescale it, bucket it, sample it, shrink it. Four small prep moves that make every downstream model behave.
What you'll learn
- Min-max normalization rescales any column into [0, 1] with a fixed formula
- Z-score standardization centers data at mean 0 with standard deviation 1
- Discretization buckets numeric data into bins (equal-width or equal-frequency)
- Sampling — random vs stratified, with vs without replacement
- Lossless vs lossy compression — when each is acceptable
Before you start
You scraped a dataset. One column is “age” in years (18 to 90). Another is “income” in rupees (₹10000 to ₹2000000). A model that just plugs these in raw will be dominated by income — the numbers are bigger, not because income matters more, but because nobody rescaled it.
That’s what data transformation fixes. Before any analysis we rescale, bucket, sample, or shrink the data so downstream tools see it on fair terms. Four small moves — let’s walk each.
Rescaling: min-max vs z-score
Two transformations cover almost every GATE question on rescaling.
Min-max normalization squashes every value into the range [0, 1]:
x' = (x − min) / (max − min)
The smallest value becomes 0, the largest becomes 1, everything else lands between them. Bounded and easy to read.
Z-score standardization centers the column at mean 0 and rescales so the standard deviation is 1:
z = (x − μ) / σ
Now most values lie in [−3, +3] — but z-scores can be negative, and they
are not capped. A wildly extreme value can sit at z = 5 or beyond.
Discretization: turning numbers into bins
Sometimes you do not want a continuous number, you want a bucket — “young / middle / old” instead of an exact age. That’s discretization (binning).
- Equal-width bins — split the range into bins of the same width. Ages
[18, 90]into three bins:[18, 42),[42, 66),[66, 90]. Simple, but one bin may end up almost empty. - Equal-frequency bins — each bin holds the same count of rows. Sort the
column, then cut at every
n/k-th value. Bins differ in width but balance out the row counts.
Sampling: picking rows instead of using them all
When the dataset is too large to analyse whole, sample it:
- Random sampling — every row has the same chance of being picked.
- Stratified sampling — split the population into groups (strata) and sample from each in proportion. Use this when a small subgroup would otherwise be missed.
- With replacement — a picked row can be picked again (used by bootstrap, the resampling trick behind bagging and confidence intervals).
- Without replacement — once picked, a row is out (the default for surveys).
Compression: shrinking the bytes
Two flavours, one decision:
- Lossless — every original bit is recoverable. Used for text, numeric tables, code (zip, gzip).
- Lossy — drops detail to shrink harder. Used for images, audio, video (JPEG, MP3) where small reconstruction errors are invisible to humans.
Rule of thumb: tabular and textual data must be lossless; perceptual media can be lossy.
How GATE asks this
Almost always a NAT: a column value, a min/max or mean/SD, and “compute the
normalized value to 3 decimals.” Sometimes an MCQ asking which rescaling
preserves which property, or an MSQ listing sampling methods or compression
properties. Read the question for μ, σ, min, max — they tell you
instantly which formula to apply.
Worked example — a real 2024 question
A person’s salary is ₹106000. The population has mean μ = ₹96000 and standard deviation σ = ₹21000. Find the z-score of this salary.
Plug straight into the standardization formula:
z = (x − μ) / σ
= (106000 − 96000) / 21000
= 10000 / 21000
≈ 0.476
So z ≈ 0.476. This is the real GATE DA 2024, Q17. The salary is about half a standard deviation above the mean — comfortably above average, but well inside the typical range.
Quick check
Quick check
Practice this in an interview
All questionsData warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.
CDC continuously captures row-level inserts, updates, and deletes from a source database and streams them downstream — enabling near-real-time replication to a warehouse or data lake without full table scans. The most robust implementation reads the database's write-ahead log (WAL), making it low-impact on the source and capable of capturing deletes that polling-based approaches miss entirely.
ETL transforms data before loading it into the destination, which was necessary when warehouses were expensive and compute-constrained. ELT loads raw data first and transforms inside the warehouse, leveraging cheap cloud compute and making raw data available for reprocessing. ELT is the default in modern cloud stacks; ETL still makes sense when you must mask sensitive fields before they ever land in the warehouse.
Standardization rescales features to zero mean and unit variance; normalization squashes values into a fixed range, usually [0, 1]. Distance-based and gradient-based models are sensitive to scale and require one of these; tree-based models split on rank order and are scale-invariant.