What loss function does logistic regression optimize, and why is it convex?
Logistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.
How to think about it
Deriving the loss from maximum likelihood:
Each observation yᵢ ∈ {0,1} follows a Bernoulli with probability pᵢ = σ(xᵢβ). The likelihood is:
L(β) = Π pᵢ^yᵢ (1 - pᵢ)^(1-yᵢ)
Taking the negative log (to turn maximization into minimization):
NLL = -Σ [yᵢ log(pᵢ) + (1-yᵢ) log(1-pᵢ)]
This is binary cross-entropy. Each term penalizes wrong predictions on a log scale — predicting 0.01 when the true label is 1 incurs a loss of -log(0.01) ≈ 4.6, a steep penalty.
Why is it convex?
The Hessian of the NLL with respect to β is:
H = XᵀSX
where S is a diagonal matrix with entries pᵢ(1-pᵢ) > 0. Since S is positive definite and H = XᵀSX, H is positive semi-definite. A twice-differentiable function with a PSD Hessian is convex — there are no local minima, only a global one.
Gradient for gradient descent:
∇_β NLL = Xᵀ(p - y)
This has the same elegant form as OLS: the gradient is the residuals back-projected through X.
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def log_loss_grad(X, y, beta):
p = sigmoid(X @ beta)
return X.T @ (p - y) / len(y) # gradient, ready for gradient descent step
With L2 regularization, the Hessian becomes XᵀSX + 2λI, which is strictly positive definite — the problem remains convex and the solution is unique.