When should you use RMSE versus MAE for regression evaluation, and what does R-squared actually tell you?
RMSE (Root Mean Squared Error) penalises large errors quadratically, making it sensitive to outliers and appropriate when big deviations are disproportionately costly. MAE (Mean Absolute Error) treats all errors linearly, is more robust to outliers, and is easier to interpret in the units of the target. R-squared measures the proportion of target variance explained by the model — a value near 1 is desirable, but it can be high even for a bad model if the baseline variance is low, and it says nothing about prediction error magnitude.
How to think about it
Cover the four main regression metrics — RMSE, MAE, MAPE, R² — with the decision rule for choosing between them.
The metrics
MAE (Mean Absolute Error) = (1/n) * sum |y - y_hat|
Linear penalty. Robust to outliers. Units match the target (e.g., dollars, kilograms). Median prediction minimises MAE; mean minimises MSE.
RMSE (Root Mean Squared Error) = sqrt((1/n) * sum (y - y_hat)²)
Quadratic penalty. Dominated by the largest errors. Units match the target. Use when large errors are especially costly — e.g., demand forecasting where a 10x overstock is far worse than a 2x overstock.
MAPE (Mean Absolute Percentage Error) = (100/n) * sum |(y - y_hat) / y|
Scale-independent, making it useful for comparing models across different targets. Undefined when y = 0; asymmetric (over-predictions are bounded at 100% but under-predictions are not). Use with caution on targets that can be zero or near-zero.
R² (Coefficient of Determination) = 1 - SS_res / SS_tot
Where SS_res = sum (y - y_hat)² and SS_tot = sum (y - y_bar)².
R² is the fraction of variance in the target explained by the model. R² = 1 is a perfect fit; R² = 0 means the model is no better than always predicting the mean; R² can be negative (meaning the model is worse than predicting the mean).
Choosing between RMSE and MAE
| Situation | Prefer |
|---|---|
| Outliers are common and should not drive evaluation | MAE |
| Large errors are disproportionately costly | RMSE |
| Communicating error to non-technical stakeholders | MAE (intuitive units) |
| Training loss for gradient-based models | MSE (differentiable everywhere) |
| Comparing across datasets with different target scales | MAPE or normalised variants |
R² gotchas
- Adding any feature to a linear regression cannot decrease R² on training data, even a random noise feature. Use adjusted R² when comparing models with different numbers of features.
- A high R² does not guarantee small absolute errors. If target variance is enormous (e.g., house prices range ±$1M), R² = 0.95 can still imply RMSE = $50,000.
- R² is defined for the OLS comparison to the mean — it is less interpretable for non-linear models or when the target distribution is highly skewed.