What is the difference between classification and regression, and how do you choose between them?
Classification predicts a discrete class label; regression predicts a continuous numeric value. The choice is determined by the nature of the target variable, not by the algorithm family — many algorithms (e.g., decision trees, neural nets) handle both.
How to think about it
The distinction is about the output space.
Classification — the target y is a category drawn from a finite set. Binary classification has two classes (fraud / not-fraud); multi-class has more (digit 0–9); multi-label allows multiple simultaneous classes (image tags). The model typically outputs a probability distribution over classes, and a threshold or argmax converts it to a label. Key metrics: accuracy, precision, recall, F1, AUC-ROC.
Regression — the target y is a real number (or vector of real numbers). Predicting tomorrow’s closing price, estimating a patient’s blood-glucose level, or forecasting demand in units are all regression problems. Key metrics: MAE, RMSE, R².
How to decide:
| Signal | Use |
|---|---|
| Target is a label or category | Classification |
| Target is a quantity on a continuous scale | Regression |
| Target is an ordered category (poor/fair/good) | Ordinal regression or classification with ordered labels |
| Predicting a count (non-negative integers) | Poisson/count regression, not standard regression |
Some problems admit both framings: predicting whether revenue exceeds $1 M is classification; predicting the revenue itself is regression. Choose the framing that matches the downstream decision.
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.datasets import make_classification, make_regression
X_c, y_c = make_classification(n_samples=500, random_state=0)
clf = LogisticRegression().fit(X_c, y_c) # discrete output
X_r, y_r = make_regression(n_samples=500, random_state=0)
reg = LinearRegression().fit(X_r, y_r) # continuous output