Should you normalize or denormalize tables in a data warehouse, and why?

Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.

What is Change Data Capture (CDC) and how is it implemented?

CDC continuously captures row-level inserts, updates, and deletes from a source database and streams them downstream — enabling near-real-time replication to a warehouse or data lake without full table scans. The most robust implementation reads the database's write-ahead log (WAL), making it low-impact on the source and capable of capturing deletes that polling-based approaches miss entirely.

What is the difference between ETL and ELT, and when should you choose each?

ETL transforms data before loading it into the destination, which was necessary when warehouses were expensive and compute-constrained. ELT loads raw data first and transforms inside the warehouse, leveraging cheap cloud compute and making raw data available for reprocessing. ELT is the default in modern cloud stacks; ETL still makes sense when you must mask sensitive fields before they ever land in the warehouse.

What is the difference between standardization and normalization, and which models require feature scaling?

Standardization rescales features to zero mean and unit variance; normalization squashes values into a fixed range, usually [0, 1]. Distance-based and gradient-based models are sensitive to scale and require one of these; tree-based models split on rank order and are scale-invariant.

Data Transformation: Normalization, Discretization, Sampling, Compression — GATE DA

What you'll learn

Min-max normalization rescales any column into [0, 1] with a fixed formula

Z-score standardization centers data at mean 0 with standard deviation 1

Discretization buckets numeric data into bins (equal-width or equal-frequency)

Sampling — random vs stratified, with vs without replacement

Lossless vs lossy compression — when each is acceptable

Last lesson left us with messy data pulled in from many sources and a promise: before any of it can be analysed together, it has to be cleaned and reshaped. So picture the result of that pull. One column is “age” in years, running 18 to 90. The next is “income” in rupees, running ₹10,000 to ₹20,00,000. Feed both into a model raw, and income will swamp the result — not because income matters more, but because its numbers are simply bigger and nobody put the two columns on the same ruler.

Putting columns on the same ruler is the first of four small prep moves, and together they are what turns raw records into analysis-ready ones. We rescale a column so its size stops lying, bucket a number into a category when that is what we actually want, sample the rows when there are too many to use whole, and shrink the bytes when storage bites. Four moves, and GATE expects you to perform the arithmetic of each by hand. Let us walk them one at a time.

Rescaling: min-max vs z-score

Two rescalings cover almost every GATE question, and the whole trick is reading which one a problem wants.

Min-max normalization squashes every value into the range [0, 1]:

x' = (x − min) / (max − min)

The smallest value becomes 0, the largest becomes 1, and everything else lands in between. Bounded, and easy to read at a glance.

Z-score standardization instead centers the column at mean 0 and rescales so its standard deviation is 1:

z = (x − μ) / σ

Now most values sit in [−3, +3] — but a z-score can be negative, and it is not capped. A wildly extreme value can land at z = 5 or beyond, which min-max could never do.

Same five salaries, two rescalings. Min-max is bounded; z-score is centered but unbounded.

Discretization: turning numbers into bins

Sometimes you do not want a continuous number at all — you want a bucket, “young / middle / old” instead of an exact age. Turning a number into a bucket is discretization, or binning, and there are two honest ways to cut.

Equal-width bins — split the range into bins of the same width. Ages [18, 90] into three bins gives [18, 42), [42, 66), [66, 90]. Simple, but one bin may end up nearly empty.
Equal-frequency bins — make each bin hold the same count of rows. Sort the column, then cut at every n/k-th value. The bins differ in width, but the row counts come out balanced.

Sampling: picking rows instead of using them all

When the dataset is too large to analyse whole, sample it — work with a representative handful instead of the lot.

Random sampling — every row has the same chance of being picked.
Stratified sampling — split the population into groups (strata) and sample from each in proportion. Reach for this when a small subgroup would otherwise be missed entirely.
With replacement — a picked row can be picked again. This is what bootstrap does, the resampling trick behind bagging and confidence intervals.
Without replacement — once picked, a row is out. This is the default for surveys.

Compression: shrinking the bytes

Two flavours, and the whole decision is which one you can afford.

Lossless — every original bit is recoverable. This is for text, numeric tables, and code (zip, gzip).
Lossy — it throws away detail to shrink harder. This is for images, audio, and video (JPEG, MP3), where tiny reconstruction errors are invisible to a human.

The rule of thumb writes itself: tabular and textual data must be lossless; only perceptual media can be lossy.

How GATE asks this

Almost always a NAT — a column value, a min/max or a mean/SD, and “compute the normalized value to 3 decimals.” Occasionally an MCQ on which rescaling preserves which property, or an MSQ listing sampling methods or compression facts. Scan the question for μ, σ, min, max: those four symbols tell you instantly which formula the problem wants.

Worked example — GATE DA 2024, Q17

A person’s salary is ₹106000. The population has mean μ = ₹96000 and standard deviation σ = ₹21000. Find the z-score of this salary.

Plug straight into the standardization formula and reduce one step at a time:

z = (x − μ) / σ
  = (106000 − 96000) / 21000
  =      10000       / 21000
  ≈ 0.476

So z ≈ 0.476. This is the real GATE DA 2024, Q17. The salary is about half a standard deviation above the mean — comfortably above average, yet well inside the typical range, just as the prediction suggested.

In one breath

Four prep moves ready raw data for analysis: min-max rescales a column onto a bounded [0, 1], z-score centers it at mean 0 with SD 1 but leaves it unbounded and possibly negative, discretization buckets a number by equal width or equal frequency, sampling picks a representative subset (random or stratified, with or without replacement), and compression shrinks the bytes losslessly for tables and text but only lossily for perceptual media.

Practice

Quick check

0/7

Q1Recall — Which statements about min-max vs z-score are TRUE? (select all that apply)select all that apply

Q2Recall — Which sampling methods can a GATE-style data-prep question reasonably name? (select all that apply)select all that apply

Q3Trace — A column has min = 20 and max = 80. Min-max normalize the value 50, to 3 decimals.numerical answer — type a number

Q4Trace — A column has mean μ = 50 and standard deviation σ = 10. The z-score of the value 35 is? (1 decimal)numerical answer — type a number

Q5Trace — Three equal-width bins for ages in [18, 90] give bin boundaries at?

Q6Apply — Which is the right kind of compression for storing a relational table of customer records?

Q7Create — You are feeding pixel intensities into a neural network whose input layer expects values strictly within [0, 1], and the data has no extreme outliers. Which rescaling fits best?

A question to carry forward

So the data is clean now — rescaled, binned where it helps, sampled to a workable size. It is finally fit to analyse. But fit to analyse where?

Running heavy “total sales by category, by state, by month, across five years” queries against the live operational database would crawl, and would fight with the checkout traffic for the same rows. Analytics needs its own home, a store built for big read queries rather than tiny writes — and inside it the tables are deliberately shaped, not in the tidy normalized form you just spent two lessons perfecting, but in a layout that minimises joins on read. Here is the thread onward: what does that analytics-first store look like, why does it sometimes choose redundancy on purpose, and what are the two standard table shapes it picks between?

Data Transformation: Normalization, Discretization, Sampling, Compression

What you'll learn

Before you start