datarekha

Convexity: why some losses are easy and others aren't

"Convex loss" is the quiet reason classical ML just works and deep nets need babysitting. Convexity is the guarantee that downhill always leads home — and you can test for it with the Hessian.

8 min read Intermediate Math for ML Lesson 16 of 30

What you'll learn

  • What makes a set and a function convex (the chord-above-the-curve picture)
  • Why convexity is the dividing line: any local minimum is the global minimum
  • The Hessian test — convex iff the Hessian is positive semi-definite
  • Which ML losses are convex (linear, ridge, logistic, SVM) and which aren't (neural nets)
  • Why L2 regularization makes a problem more convex and better-behaved

Before you start

When people say a loss is convex, they’re making a promise: every path downhill ends at the same bottom. That one guarantee is why linear and logistic regression train reliably from any starting point, while deep nets need careful initialization, learning-rate schedules, and a bit of luck.

What “convex” means

A convex set is one with no dents: the straight line between any two points in it stays inside. A convex function is shaped like a bowl — formally, the chord connecting any two points on the graph lies on or above the curve:

f(t·a + (1−t)·b) ≤ t·f(a) + (1−t)·f(b)     for all t in [0, 1]

The consequence is everything: a convex function has no local minima other than the global one. There is nowhere to get stuck.

Convex: every start lands at the bottom. Non-convex: the answer depends on where you began — the same algorithm gives different, sometimes worse, results.

The Hessian test

You don’t have to eyeball it. A twice-differentiable function is convex exactly when its Hessian is positive semi-definite (all eigenvalues ≥ 0) everywhere. In 1D that’s just f″(x) ≥ 0 — the curve only ever bends upward.

Which losses are convex?

  • Convex (train reliably): linear regression (MSE), ridge & lasso, logistic regression, softmax/cross-entropy in the linear case, SVM hinge loss. For these, gradient descent — or a convex solver — finds the optimum, full stop.
  • Non-convex (the hard, interesting world): anything with a hidden layer. Stacking non-linear activations destroys convexity, so neural nets have countless local minima and saddles. We don’t find the global optimum; we find a good enough one — and remarkably, in high dimensions, good-enough is usually plenty.

Quick check

Quick check

0/3
Q1A loss is convex. You run gradient descent twice from two very different starting points. What can you say about the results?
Q2How do you test whether a twice-differentiable function is convex?
Q3Why does adding λ‖w‖² (L2) make optimization better-behaved beyond reducing overfitting?

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What loss function does logistic regression optimize, and why is it convex?

Logistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.

Why use cross-entropy loss instead of MSE for classification?

MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.

What is log loss and why does it penalise confident wrong predictions more than uncertain ones?

Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.

Why does sigmoid saturation cause vanishing gradients, and why is tanh only a partial fix?

Sigmoid's derivative peaks at 0.25 and approaches zero in both tails, so the chain of gradient multiplications collapses exponentially in deep networks. Tanh's derivative peaks at 1 and is zero-centered, which helps weight update symmetry, but it still saturates at large magnitudes and the gradient still shrinks to near-zero in both tails.

Related lessons

Explore further

Skip to content