datarekha

Data Transformation: Normalization, Discretization, Sampling, Compression

Get raw data ready for analysis — rescale it, bucket it, sample it, shrink it. Four small prep moves that make every downstream model behave.

8 min read Intermediate GATE DA Lesson 74 of 122

What you'll learn

  • Min-max normalization rescales any column into [0, 1] with a fixed formula
  • Z-score standardization centers data at mean 0 with standard deviation 1
  • Discretization buckets numeric data into bins (equal-width or equal-frequency)
  • Sampling — random vs stratified, with vs without replacement
  • Lossless vs lossy compression — when each is acceptable

Before you start

You scraped a dataset. One column is “age” in years (18 to 90). Another is “income” in rupees (₹10000 to ₹2000000). A model that just plugs these in raw will be dominated by income — the numbers are bigger, not because income matters more, but because nobody rescaled it.

That’s what data transformation fixes. Before any analysis we rescale, bucket, sample, or shrink the data so downstream tools see it on fair terms. Four small moves — let’s walk each.

Rescaling: min-max vs z-score

Two transformations cover almost every GATE question on rescaling.

Min-max normalization squashes every value into the range [0, 1]:

x' = (x − min) / (max − min)

The smallest value becomes 0, the largest becomes 1, everything else lands between them. Bounded and easy to read.

Z-score standardization centers the column at mean 0 and rescales so the standard deviation is 1:

z = (x − μ) / σ

Now most values lie in [−3, +3] — but z-scores can be negative, and they are not capped. A wildly extreme value can sit at z = 5 or beyond.

Raw columnMin-max → [0, 1]Z-score (mean 0, SD 1)75000960001060001200001500000.000.280.410.601.00−1.000.000.481.142.57min = 75000, max = 150000, μ = 96000, σ = 21000
Same five salaries, two rescalings. Min-max is bounded; z-score is centered but unbounded.

Discretization: turning numbers into bins

Sometimes you do not want a continuous number, you want a bucket — “young / middle / old” instead of an exact age. That’s discretization (binning).

  • Equal-width bins — split the range into bins of the same width. Ages [18, 90] into three bins: [18, 42), [42, 66), [66, 90]. Simple, but one bin may end up almost empty.
  • Equal-frequency bins — each bin holds the same count of rows. Sort the column, then cut at every n/k-th value. Bins differ in width but balance out the row counts.

Sampling: picking rows instead of using them all

When the dataset is too large to analyse whole, sample it:

  • Random sampling — every row has the same chance of being picked.
  • Stratified sampling — split the population into groups (strata) and sample from each in proportion. Use this when a small subgroup would otherwise be missed.
  • With replacement — a picked row can be picked again (used by bootstrap, the resampling trick behind bagging and confidence intervals).
  • Without replacement — once picked, a row is out (the default for surveys).

Compression: shrinking the bytes

Two flavours, one decision:

  • Lossless — every original bit is recoverable. Used for text, numeric tables, code (zip, gzip).
  • Lossy — drops detail to shrink harder. Used for images, audio, video (JPEG, MP3) where small reconstruction errors are invisible to humans.

Rule of thumb: tabular and textual data must be lossless; perceptual media can be lossy.

How GATE asks this

Almost always a NAT: a column value, a min/max or mean/SD, and “compute the normalized value to 3 decimals.” Sometimes an MCQ asking which rescaling preserves which property, or an MSQ listing sampling methods or compression properties. Read the question for μ, σ, min, max — they tell you instantly which formula to apply.

Worked example — a real 2024 question

A person’s salary is ₹106000. The population has mean μ = ₹96000 and standard deviation σ = ₹21000. Find the z-score of this salary.

Plug straight into the standardization formula:

z = (x − μ) / σ
  = (106000 − 96000) / 21000
  =      10000      / 21000
  ≈ 0.476

So z ≈ 0.476. This is the real GATE DA 2024, Q17. The salary is about half a standard deviation above the mean — comfortably above average, but well inside the typical range.

Quick check

Quick check

0/7
Q1A column has min = 20 and max = 80. Min-max normalize the value 50. (3 decimals)numerical answer — type a number
Q2A column has mean μ = 50 and standard deviation σ = 10. The z-score of the value 35 is? (1 decimal)numerical answer — type a number
Q3Which statements about min-max vs z-score are TRUE? (select all that apply)select all that apply
Q4Which sampling methods can a GATE-style data-prep question reasonably name? (select all that apply)select all that apply
Q5Which is the right kind of compression for storing a relational table of customer records?
Q6Three equal-width bins for ages in [18, 90] give bin boundaries at?
Q7You are feeding pixel intensities into a neural network whose input layer expects values strictly within [0, 1], and the data has no extreme outliers. Which rescaling fits best?

Practice this in an interview

All questions
Should you normalize or denormalize tables in a data warehouse, and why?

Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.

What is Change Data Capture (CDC) and how is it implemented?

CDC continuously captures row-level inserts, updates, and deletes from a source database and streams them downstream — enabling near-real-time replication to a warehouse or data lake without full table scans. The most robust implementation reads the database's write-ahead log (WAL), making it low-impact on the source and capable of capturing deletes that polling-based approaches miss entirely.

What is the difference between ETL and ELT, and when should you choose each?

ETL transforms data before loading it into the destination, which was necessary when warehouses were expensive and compute-constrained. ELT loads raw data first and transforms inside the warehouse, leveraging cheap cloud compute and making raw data available for reprocessing. ELT is the default in modern cloud stacks; ETL still makes sense when you must mask sensitive fields before they ever land in the warehouse.

What is the difference between standardization and normalization, and which models require feature scaling?

Standardization rescales features to zero mean and unit variance; normalization squashes values into a fixed range, usually [0, 1]. Distance-based and gradient-based models are sensitive to scale and require one of these; tree-based models split on rank order and are scale-invariant.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content