datarekha

Supervised vs Unsupervised; Train/Test

The two great families of machine learning, and the one discipline every experiment obeys: keep the test set unseen until the very end.

6 min read Beginner GATE DA Lesson 77 of 122

What you'll learn

  • Supervised learning uses labelled data: regression for continuous targets, classification for discrete ones
  • Unsupervised learning has no labels: clustering and dimensionality reduction find structure on their own
  • Features are the inputs, the label is the target; a model learns the mapping from one to the other
  • The train / validation / test split — and why the test set must stay unseen to measure true generalization

Before you start

Machine learning is the art of letting a program learn a pattern from data instead of being told the rule by hand. Before any formula, you only need one distinction: does your data come with answers attached, or not? That single question splits the entire field in two.

Supervised vs unsupervised — the great divide

In supervised learning each training example carries a known answer, called a label. The model sees inputs paired with the correct output and learns to reproduce that mapping on new, unseen inputs. If the label is a continuous number (a price, a temperature) the task is regression; if it is a discrete class (spam or not-spam, the digit 0-9) it is classification.

In unsupervised learning there are no labels — just the inputs. The model hunts for structure on its own: clustering groups similar points together, and dimensionality reduction compresses many features into a few while keeping the important variation.

Machine LearningSupervised (labels)Unsupervised (no labels)Regressioncontinuous targetClassificationdiscrete classClusteringgroup similar pointsDimensionalityreductioncompress features
The map of the ML block: labels present means supervised; absent means unsupervised.

Features, labels, and what a model learns

A dataset is a table. Each row is one example; each column is a measured quantity. The input columns are the features (often written X), and in supervised learning the answer column is the label or target (y). A supervised model learns a function that maps features to label, f(X) → y, so that when a brand-new row arrives with features but no answer, it can predict one. Unsupervised methods see only X and must invent the structure themselves.

Train, validation, test — and the golden rule

You cannot judge a model by how well it memorizes the data it studied — anyone can ace an open-book test on questions they have already seen. So we split the data into three disjoint parts:

Training set (~60-80%)Validation~10-20%Testsealedfit the model heretune / choose herejudge once hereThe test set is opened exactly once, at the very end.
Train to fit, validate to tune, test to report — and never the other way around.
  • Training set — the model fits its parameters here.
  • Validation set — used to tune knobs (which model, how complex) during development, so you can compare options without touching the test set.
  • Test set — a sealed sample, opened once, to estimate how the model will behave on genuinely new data. This number is your honest generalization score.

A model that does great on training data but poorly on unseen data is overfitting — it memorized noise instead of the signal. The opposite, underfitting, is a model too simple to capture the real pattern, so it does poorly even on the training data. The validation set is how you steer between the two.

How GATE asks this

This topic is almost always a conceptual MCQ or MSQ: classify a list of tasks as regression / classification / clustering, or pick the true statements about the train-test split (for example, “the test set may be used to tune hyperparameters” is false). Occasionally a tiny NAT asks for a split size — if 1000 samples are split 80/20, the test set has 200. Read every option for the word that fixes the family: a continuous target means regression, a class label means classification, no labels at all means unsupervised.

Worked example — name that task

Classify each task by family:

House-price prediction from size & location  →  supervised · REGRESSION
        (target is a continuous number — a price)

Email spam detection (spam / not-spam)        →  supervised · CLASSIFICATION
        (target is a discrete label — two classes)

Customer segmentation from purchase history   →  unsupervised · CLUSTERING
        (no labels given — the groups are discovered)

Compressing 100 sensor readings to 2 axes     →  unsupervised · DIM. REDUCTION
        (no target — just summarise the variation)

The deciding question each time: is there a known answer column? If yes, is it a number (regression) or a category (classification)? If no, the task is unsupervised.

Quick check

Quick check

0/5
Q1A model predicts tomorrow's temperature in °C from today's weather readings. Which best describes this task?
Q2Which of the following are SUPERVISED learning tasks? (select all that apply)select all that apply
Q3Which statements about the train / validation / test split are TRUE? (select all that apply)select all that apply
Q4A dataset of 1000 samples is split 80/20 into train and test. How many samples are in the test set?numerical answer — type a number
Q5A model scores 99% accuracy on its training data but only 62% on unseen test data. This is a classic sign of:

Practice this in an interview

All questions

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content