What is the difference between supervised, unsupervised, and reinforcement learning?

Supervised learning trains on labeled input-output pairs to predict a target. Unsupervised learning finds structure in unlabeled data. Reinforcement learning trains an agent to maximize cumulative reward through trial-and-error interaction with an environment.

Why do we split data into train, validation, and test sets, and what are the typical proportions?

The train set fits the model, the validation set tunes hyperparameters and guides model selection, and the held-out test set provides an unbiased estimate of final generalization error. Using the test set during development causes optimistic bias because the evaluation signal leaks into decisions.

Walk me through how you'd select between competing models without fooling yourself with data leakage.

Split data into train, validation, and test sets (or use cross-validation), tune and compare models only on train/validation, and touch the test set exactly once at the end. Fit all preprocessing inside the cross-validation pipeline so transformers never see validation data, and for tuning plus honest evaluation use nested cross-validation. For time series, use forward-chaining splits to avoid leaking future information.

Bagging vs boosting — how do they differ, and when does each help?

Bagging trains many independent models in parallel on bootstrap samples and averages them, which mainly reduces variance; boosting trains models sequentially so each corrects its predecessor's errors, which mainly reduces bias. Use bagging (e.g. random forests) when your base learner is high-variance and overfits; use boosting (e.g. gradient boosting) when you need to squeeze out bias and maximize accuracy, accepting more tuning and overfitting risk.

Supervised vs Unsupervised; Train/Test — GATE DA

What you'll learn

Supervised learning uses labelled data: regression for continuous targets, classification for discrete ones

Unsupervised learning has no labels: clustering and dimensionality reduction find structure on their own

Features are the inputs, the label is the target; a model learns the mapping from one to the other

The train / validation / test split — and why the test set must stay unseen to measure true generalization

The database chapter answered one kind of question superbly: what already happened? Sum last year’s sales, average them by region, roll them up to the quarter. But it left us hungry for the question no SELECT can reach — what happens next? Hand the warehouse a house’s size and locality and ask its likely selling price, and every query falls silent, because that price sits in no row anywhere. It has to be predicted.

Predicting it means letting a program learn a pattern from data instead of being handed the rule. And before a single formula, you only need one distinction to organise the whole field: does your data come with answers attached, or not? That one question cuts machine learning cleanly in two.

Supervised vs unsupervised — the great divide

In supervised learning, every training example arrives with a known answer, called a label. The model studies inputs paired with their correct outputs and learns to reproduce that mapping on new, unseen inputs. If the label is a continuous number — a price, a temperature — the task is regression; if it is a discrete class — spam or not-spam, the digit 0 through 9 — it is classification.

In unsupervised learning there are no labels, only the inputs. The model has to find structure unaided: clustering gathers similar points into groups, and dimensionality reduction squeezes many features down to a few while holding on to the variation that matters.

The map of the ML block: labels present means supervised; absent means unsupervised.

Features, labels, and what a model learns

A dataset is a table. Each row is one example, and each column is a measured quantity. The input columns are the features (often written X), and in supervised learning the answer column is the label or target (y). A supervised model learns a function mapping features to label, f(X) → y, so that when a brand-new row turns up with features but no answer, it can fill the answer in. Unsupervised methods see only X, and must invent the structure for themselves.

Train, validation, test — and the golden rule

You cannot judge a model by how well it memorises the data it studied — anyone can ace an open-book test on questions they have already seen. So we split the data into three disjoint parts, each with one job.

Train to fit, validate to tune, test to report — and never the other way around.

Training set — the model fits its parameters here.
Validation set — used to tune the knobs (which model, how complex) during development, so you can compare options without ever touching the test set.
Test set — a sealed sample, opened once, to estimate how the model behaves on genuinely new data. That single number is your honest generalization score.

A model that shines on training data but stumbles on unseen data is overfitting — it memorised noise instead of signal. Its opposite, underfitting, is a model too simple to catch the real pattern, so it does poorly even on the data it trained on. The validation set is how you steer between the two.

How GATE asks this

Almost always a conceptual MCQ or MSQ: classify a list of tasks as regression / classification / clustering, or pick the true statements about the train-test split (for instance, “the test set may be used to tune hyperparameters” is false). Occasionally a tiny NAT asks for a split size — 1000 samples split 80/20 puts 200 in the test set. Read every option for the word that fixes the family: a continuous target means regression, a class label means classification, no labels at all means unsupervised.

Worked example — name that task

Classify each task by its learning family.

House-price prediction from size & location  →  supervised · REGRESSION
        (target is a continuous number — a price)

Email spam detection (spam / not-spam)        →  supervised · CLASSIFICATION
        (target is a discrete label — two classes)

Customer segmentation from purchase history   →  unsupervised · CLUSTERING
        (no labels given — the groups are discovered)

Compressing 100 sensor readings to 2 axes     →  unsupervised · DIM. REDUCTION
        (no target — just summarise the variation)

The deciding question every time: is there a known answer column? If yes, is it a number (regression) or a category (classification)? If no, the task is unsupervised — and then it is clustering if you are grouping points, dimensionality reduction if you are compressing features.

In one breath

Machine learning predicts what no stored row holds by learning a pattern from examples, and it splits in two on a single question — supervised if each example carries a label (regression for a continuous target, classification for a discrete class), unsupervised if it does not (clustering to group, dimensionality reduction to compress); whichever family, you protect the verdict by splitting the data into a training set to fit, a validation set to tune, and a test set opened exactly once, because a wide train-to-test gap is the signature of overfitting.

Practice

Quick check

0/5

Q1Recall — Which of the following are SUPERVISED learning tasks? (select all that apply)select all that apply

Q2Recall — Which statements about the train / validation / test split are TRUE? (select all that apply)select all that apply

Q3Trace — A dataset of 1000 samples is split 80/20 into train and test. How many samples are in the test set?numerical answer — type a number

Q4Apply — A model predicts tomorrow's temperature in °C from today's weather readings. Which best describes this task?

Q5Create — A model scores 99% accuracy on its training data but only 62% on unseen test data. This is a classic sign of:

A question to carry forward

So the first supervised family is regression: a continuous answer, learned from labelled examples. House size in, price out. But “learn the mapping f(X) → y” is still just a slogan — what does the simplest honest version of it actually look like?

Strip it to the bone. One input feature, one numeric target, and the plainest shape a relationship can take: a straight line. Fit a line through a scatter of points and you can read a prediction straight off it. Here is the thread onward: out of the infinitely many lines you could draw through a cloud of points, what makes one line the best fit — and is there a formula that hands it to you directly?

Supervised vs Unsupervised; Train/Test

What you'll learn

Before you start

Supervised vs unsupervised — the great divide

Features, labels, and what a model learns

Train, validation, test — and the golden rule

How GATE asks this

Worked example — name that task

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further