Supervised vs Unsupervised; Train/Test
The two great families of machine learning, and the one discipline every experiment obeys: keep the test set unseen until the very end.
What you'll learn
- Supervised learning uses labelled data: regression for continuous targets, classification for discrete ones
- Unsupervised learning has no labels: clustering and dimensionality reduction find structure on their own
- Features are the inputs, the label is the target; a model learns the mapping from one to the other
- The train / validation / test split — and why the test set must stay unseen to measure true generalization
Before you start
Machine learning is the art of letting a program learn a pattern from data instead of being told the rule by hand. Before any formula, you only need one distinction: does your data come with answers attached, or not? That single question splits the entire field in two.
Supervised vs unsupervised — the great divide
In supervised learning each training example carries a known answer, called a label. The model sees inputs paired with the correct output and learns to reproduce that mapping on new, unseen inputs. If the label is a continuous number (a price, a temperature) the task is regression; if it is a discrete class (spam or not-spam, the digit 0-9) it is classification.
In unsupervised learning there are no labels — just the inputs. The model hunts for structure on its own: clustering groups similar points together, and dimensionality reduction compresses many features into a few while keeping the important variation.
Features, labels, and what a model learns
A dataset is a table. Each row is one example; each column is a measured
quantity. The input columns are the features (often written X), and in
supervised learning the answer column is the label or target (y). A
supervised model learns a function that maps features to label, f(X) → y,
so that when a brand-new row arrives with features but no answer, it can predict
one. Unsupervised methods see only X and must invent the structure themselves.
Train, validation, test — and the golden rule
You cannot judge a model by how well it memorizes the data it studied — anyone can ace an open-book test on questions they have already seen. So we split the data into three disjoint parts:
- Training set — the model fits its parameters here.
- Validation set — used to tune knobs (which model, how complex) during development, so you can compare options without touching the test set.
- Test set — a sealed sample, opened once, to estimate how the model will behave on genuinely new data. This number is your honest generalization score.
A model that does great on training data but poorly on unseen data is overfitting — it memorized noise instead of the signal. The opposite, underfitting, is a model too simple to capture the real pattern, so it does poorly even on the training data. The validation set is how you steer between the two.
How GATE asks this
This topic is almost always a conceptual MCQ or MSQ: classify a list of tasks as regression / classification / clustering, or pick the true statements about the train-test split (for example, “the test set may be used to tune hyperparameters” is false). Occasionally a tiny NAT asks for a split size — if 1000 samples are split 80/20, the test set has 200. Read every option for the word that fixes the family: a continuous target means regression, a class label means classification, no labels at all means unsupervised.
Worked example — name that task
Classify each task by family:
House-price prediction from size & location → supervised · REGRESSION
(target is a continuous number — a price)
Email spam detection (spam / not-spam) → supervised · CLASSIFICATION
(target is a discrete label — two classes)
Customer segmentation from purchase history → unsupervised · CLUSTERING
(no labels given — the groups are discovered)
Compressing 100 sensor readings to 2 axes → unsupervised · DIM. REDUCTION
(no target — just summarise the variation)
The deciding question each time: is there a known answer column? If yes, is it a number (regression) or a category (classification)? If no, the task is unsupervised.
Quick check
Quick check
Practice this in an interview
All questionsSupervised learning trains on labeled input-output pairs to predict a target. Unsupervised learning finds structure in unlabeled data. Reinforcement learning trains an agent to maximize cumulative reward through trial-and-error interaction with an environment.
The train set fits the model, the validation set tunes hyperparameters and guides model selection, and the held-out test set provides an unbiased estimate of final generalization error. Using the test set during development causes optimistic bias because the evaluation signal leaks into decisions.
Deep learning wins when data is abundant, inputs are unstructured (images, text, audio), and features are hard to engineer by hand. Classical ML wins on structured tabular data, small datasets, and when interpretability or training speed matter.
Small labelled datasets call for a layered strategy: transfer learning from a pretrained backbone, heavy data augmentation, self-supervised pretraining on unlabelled data, and regularisation to prevent the model memorising the few examples it sees.