What is the fundamental difference between L1 (Lasso) and L2 (Ridge) regularization, and when do you choose each?

L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

How do L1 and L2 regularization affect bias and variance, and when would you pick one over the other?

Both L1 and L2 add a penalty on coefficient size that increases bias slightly but reduces variance, combating overfitting. L2 (ridge) shrinks all coefficients smoothly and handles correlated features well; L1 (lasso) drives some coefficients exactly to zero, performing feature selection. Choose L1 when you want sparsity and interpretability, L2 when you want stability, and elastic net to get both.

How do you attribute and control ML spend across teams and models (FinOps for ML)?

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

Constrained optimization & Lagrange multipliers — Math for ML

Constrained optimization & Lagrange multipliers

Most ML optimization comes with strings attached — keep the weights small, keep the probabilities summing to one, classify everything correctly. Lagrange multipliers turn "optimize subject to a rule" into a single elegant equation, and they're the math behind SVMs, regularization, and PCA.

9 min read Advanced Math for ML Lesson 19 of 37

What you'll learn

Why so much of ML is constrained optimization, not free optimization

The geometric heart: at the optimum, ∇f and ∇g are parallel

The Lagrangian and how it turns a constrained problem into an unconstrained one

A first look at KKT — what changes when constraints are inequalities

How this powers SVMs (support vectors), regularization (λ), and PCA (eigenvalues)

The last lesson closed by noting that every method so far roamed freely, while real problems come fenced in: minimise the loss, yes — but subject to a rule. The rest of this course is full of them:

Minimise loss subject to ‖w‖² ≤ c (don’t let the weights explode).
Maximise the margin subject to every point classified correctly (SVM).
Maximise likelihood subject to the probabilities summing to 1.
Maximise variance subject to the direction being a unit vector (PCA).

Lagrange multipliers are the one idea that handles all of these — and, as you will see, the “price” they attach to a constraint turns out to be exactly the λ you already tune in ridge.

The geometric insight

You’re minimizing f but you’re locked onto the constraint surface g = 0. At the best allowed point, you can’t lower f any further without stepping off the constraint. That happens exactly when the level curve of f is tangent to the constraint — and tangency means the two gradients point along the same line:

∇f = λ ∇g

λ (the Lagrange multiplier) is just the scaling factor between them.

Stay on the circle g = 0, get as close as you can to the target. The best spot ● is where the dashed objective curve just *touches* the circle.

There, ∇f and ∇g point along the same line:

∇f = λ ∇g

λ ≈ -0.33

Drag the target. Wherever it goes, the optimum keeps the two gradients parallel — that's the whole method.

The Lagrangian: one function to rule them

Bundle the objective and the constraint into a single Lagrangian:

L(x, λ) = f(x) − λ · g(x)

Set all its partial derivatives to zero. The x-derivatives give ∇f = λ∇g (tangency); the λ-derivative gives back g(x) = 0 (stay on the constraint). A constrained problem in x became an unconstrained stationary-point problem in x and λ together.

import numpy as np

# Minimize f = (x - T).(x - T) on the circle ‖x‖ = R  (closest point on a circle)
T = np.array([2.0, 1.3]); R = 1.8

x_star = R * T / np.linalg.norm(T)            # the analytic optimum
grad_f = 2 * (x_star - T)                      # ∇f at the optimum
grad_g = 2 * x_star                            # ∇g (circle constraint)

# They must be parallel: grad_f = λ grad_g  ->  cross product ≈ 0
lam = grad_f[0] / grad_g[0]
print("optimum:", x_star.round(3))
print("∇f =", grad_f.round(3), "  ∇g =", grad_g.round(3))
print("parallel? cross =", round(grad_f[0]*grad_g[1] - grad_f[1]*grad_g[0], 6), " λ =", round(lam, 3))

optimum: [1.509 0.981]
∇f = [-0.982 -0.638]   ∇g = [3.018 1.962]
parallel? cross = -0.0  λ = -0.325

The optimum is the point on the circle closest to T, and the readout proves the geometry: ∇f and ∇g have a cross product of zero — they are parallel — and λ = −0.325 is the single constant linking them. The sign of λ just reflects that here we minimised toward a point inside the circle.

Inequalities: a peek at KKT

When the constraint is g(x) ≤ 0 instead of = 0, the KKT conditions generalize the idea. The key new rule is complementary slackness: λ ≥ 0 and λ · g(x) = 0. In plain terms — either the constraint is active (you’re pressed against it, g = 0, λ > 0) or it’s slack (you’re safely inside, so λ = 0 and it doesn’t matter). This single rule is what defines an SVM’s support vectors.

Where this lives in ML

SVMs. Maximizing the margin under “classify everything correctly” is a constrained problem; its Lagrangian dual is what you actually solve. The points with non-zero multipliers are the support vectors — the only ones that matter.
Regularization. “Minimize loss subject to ‖w‖² ≤ c” and “minimize loss + λ‖w‖²” are the same problem — λ is literally the Lagrange multiplier of the constraint. That’s why ridge’s λ and a hard norm budget are two views of one idea.
PCA. “Maximize variance vᵀΣv subject to ‖v‖ = 1” has Lagrangian giving Σv = λv — the eigenvalue equation. The principal directions are the Lagrange-stationary directions.

In one breath

Most ML optimization is constrained — minimise or maximise subject to a rule — and Lagrange multipliers crack it with one geometric fact: at the best allowed point, the objective’s level curve is tangent to the constraint, so their gradients align, ∇f = λ∇g. Bundle them into the Lagrangian L(x, λ) = f(x) − λg(x), set every partial to zero, and a constrained problem becomes an ordinary stationary-point problem in x and λ together. For inequality constraints, KKT adds complementary slackness (λ ≥ 0, λ·g = 0 — a constraint is either active or slack), which is exactly what singles out an SVM’s support vectors. And λ is a shadow price: ridge’s regularization strength is the Lagrange multiplier of a weight budget, and PCA’s Σv = λv is a Lagrangian in disguise.

Practice

Quick check

0/3

Q1At a constrained optimum, what is the relationship between ∇f and ∇g?

Q2Ridge regression 'minimize loss + λ‖w‖²' is equivalent to which constrained problem?

Q3In KKT, complementary slackness (λ·g = 0) tells you that for an SVM…

A question to carry forward

That closes the calculus half of the course. Look back at its whole arc: the derivative told us which way to step, the gradient pointed downhill in a million dimensions, backprop computed it for free, the Hessian read the curvature, convexity guaranteed a single bottom, and Lagrange multipliers fenced the search onto a constraint. Every one of those tools answered the same kind of question — given a loss, how do I find the parameters that minimise it?

But sit with the words “given a loss.” We never asked where the loss comes from, and we quietly pretended the data was fixed and known. It is neither. Real data is noisy, partial, and random — a click that may or may not land, a measurement smeared by error, a label that is right only nine times in ten. Before you can minimise an expected loss you must know what “expected” even means; before you can read a classifier’s 0.87 you must know it is a probability. Here is the thread into the next chapter: what is a random variable, what three rules does every probability obey, and why are independence and conditioning — the gulf between P(A|B) and P(B|A) — the quiet engine beneath every classifier output, every A/B test, and every loss you have spent this chapter minimising?

Constrained optimization & Lagrange multipliers

What you'll learn

Before you start

The geometric insight

The Lagrangian: one function to rule them

Inequalities: a peek at KKT

Where this lives in ML

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further