Constrained optimization & Lagrange multipliers
Most ML optimization comes with strings attached — keep the weights small, keep the probabilities summing to one, classify everything correctly. Lagrange multipliers turn "optimize subject to a rule" into a single elegant equation, and they're the math behind SVMs, regularization, and PCA.
What you'll learn
- Why so much of ML is constrained optimization, not free optimization
- The geometric heart: at the optimum, ∇f and ∇g are parallel
- The Lagrangian and how it turns a constrained problem into an unconstrained one
- A first look at KKT — what changes when constraints are inequalities
- How this powers SVMs (support vectors), regularization (λ), and PCA (eigenvalues)
Before you start
Pure “minimize this loss” is rare. Real problems come with rules:
- Minimize loss subject to
‖w‖² ≤ c(don’t let weights explode). - Maximize the margin subject to every point classified correctly (SVM).
- Maximize likelihood subject to the probabilities summing to 1.
- Maximize variance subject to the direction being a unit vector (PCA).
Lagrange multipliers are the tool that handles all of these with one idea.
The geometric insight
You’re minimizing f but you’re locked onto the constraint surface g = 0.
At the best allowed point, you can’t lower f any further without stepping
off the constraint. That happens exactly when the level curve of f is
tangent to the constraint — and tangency means the two gradients point
along the same line:
∇f = λ ∇g
λ (the Lagrange multiplier) is just the scaling factor between them.
The Lagrangian: one function to rule them
Bundle the objective and the constraint into a single Lagrangian:
L(x, λ) = f(x) − λ · g(x)
Set all its partial derivatives to zero. The x-derivatives give
∇f = λ∇g (tangency); the λ-derivative gives back g(x) = 0 (stay on the
constraint). A constrained problem in x became an unconstrained
stationary-point problem in x and λ together.
The cross product is zero — the gradients are parallel, and λ is the
constant linking them.
Inequalities: a peek at KKT
When the constraint is g(x) ≤ 0 instead of = 0, the KKT conditions
generalize the idea. The key new rule is complementary slackness:
λ ≥ 0 and λ · g(x) = 0. In plain terms — either the constraint is
active (you’re pressed against it, g = 0, λ > 0) or it’s slack
(you’re safely inside, so λ = 0 and it doesn’t matter). This single rule is
what defines an SVM’s support vectors.
Where this lives in ML
- SVMs. Maximizing the margin under “classify everything correctly” is a constrained problem; its Lagrangian dual is what you actually solve. The points with non-zero multipliers are the support vectors — the only ones that matter.
- Regularization. “Minimize loss subject to
‖w‖² ≤ c” and “minimize loss+ λ‖w‖²” are the same problem —λis literally the Lagrange multiplier of the constraint. That’s why ridge’sλand a hard norm budget are two views of one idea. - PCA. “Maximize variance
vᵀΣvsubject to‖v‖ = 1” has Lagrangian givingΣv = λv— the eigenvalue equation. The principal directions are the Lagrange-stationary directions.
Quick check
Quick check
Practice this in an interview
All questionsL1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.
Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.
Both L1 and L2 add a penalty on coefficient size that increases bias slightly but reduces variance, combating overfitting. L2 (ridge) shrinks all coefficients smoothly and handles correlated features well; L1 (lasso) drives some coefficients exactly to zero, performing feature selection. Choose L1 when you want sparsity and interpretability, L2 when you want stability, and elastic net to get both.
Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.