Machine Learning Hard Asked at GoogleAsked at DeepMindAsked at Jane Street

What loss function does logistic regression optimize, and why is it convex?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Logistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.

How to think about it

Deriving the loss from maximum likelihood:

Each observation yᵢ ∈ {0,1} follows a Bernoulli with probability pᵢ = σ(xᵢβ). The likelihood is:

L(β) = Π pᵢ^yᵢ (1 - pᵢ)^(1-yᵢ)

Taking the negative log (to turn maximization into minimization):

NLL = -Σ [yᵢ log(pᵢ) + (1-yᵢ) log(1-pᵢ)]

This is binary cross-entropy. Each term penalizes wrong predictions on a log scale — predicting 0.01 when the true label is 1 incurs a loss of -log(0.01) ≈ 4.6, a steep penalty.

Why is it convex?

The Hessian of the NLL with respect to β is:

H = XᵀSX

where S is a diagonal matrix with entries pᵢ(1-pᵢ) > 0. Since S is positive definite and H = XᵀSX, H is positive semi-definite. A twice-differentiable function with a PSD Hessian is convex — there are no local minima, only a global one.

Gradient for gradient descent:

∇_β NLL = Xᵀ(p - y)

This has the same elegant form as OLS: the gradient is the residuals back-projected through X.

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def log_loss_grad(X, y, beta):
    p = sigmoid(X @ beta)
    return X.T @ (p - y) / len(y)  # gradient, ready for gradient descent step

With L2 regularization, the Hessian becomes XᵀSX + 2λI, which is strictly positive definite — the problem remains convex and the solution is unique.

Learn it properly Logistic regression

What loss function does logistic regression optimize, and why is it convex?

Keep practising

Explore further