datarekha
Machine Learning Hard Asked at GoogleAsked at DeepMindAsked at Jane Street

What loss function does logistic regression optimize, and why is it convex?

The short answer

Logistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.

How to think about it

Deriving the loss from maximum likelihood:

Each observation yᵢ ∈ {0,1} follows a Bernoulli with probability pᵢ = σ(xᵢβ). The likelihood is:

L(β) = Π pᵢ^yᵢ (1 - pᵢ)^(1-yᵢ)

Taking the negative log (to turn maximization into minimization):

NLL = -Σ [yᵢ log(pᵢ) + (1-yᵢ) log(1-pᵢ)]

This is binary cross-entropy. Each term penalizes wrong predictions on a log scale — predicting 0.01 when the true label is 1 incurs a loss of -log(0.01) ≈ 4.6, a steep penalty.

Why is it convex?

The Hessian of the NLL with respect to β is:

H = XᵀSX

where S is a diagonal matrix with entries pᵢ(1-pᵢ) > 0. Since S is positive definite and H = XᵀSX, H is positive semi-definite. A twice-differentiable function with a PSD Hessian is convex — there are no local minima, only a global one.

Gradient for gradient descent:

∇_β NLL = Xᵀ(p - y)

This has the same elegant form as OLS: the gradient is the residuals back-projected through X.

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def log_loss_grad(X, y, beta):
    p = sigmoid(X @ beta)
    return X.T @ (p - y) / len(y)  # gradient, ready for gradient descent step

With L2 regularization, the Hessian becomes XᵀSX + 2λI, which is strictly positive definite — the problem remains convex and the solution is unique.

Learn it properly Logistic regression

Keep practising

All Machine Learning questions

Explore further

Skip to content