LU Decomposition
Factor A = L·U into a lower- and an upper-triangular matrix — it is Gaussian elimination remembered, and it makes solving Ax = b for many right-hand sides cheap.
What you'll learn
- LU factorisation: A = L·U with L lower-triangular (unit diagonal) and U upper-triangular
- U is the row-echelon form; L stores the elimination multipliers
- Why LU is used: solve Ax = b cheaply for many different b via forward- then back-substitution
- Pivoting (PA = LU) exists for numerical stability and when a zero pivot appears
Before you start
Here’s the small annoyance behind LU: Gaussian elimination does a pile of work to
triangulate a matrix, and then we throw all of it away. LU decomposition is
just elimination that keeps the receipt. You split A into two triangular
pieces, A = L · U. The U is the upper-triangular echelon form you already get
from elimination. The L is a lower-triangular matrix (with 1s on its
diagonal) that records the multipliers you used along the way.
The win is reuse. Once L and U are saved, solving Ax = b for many different
right-hand sides becomes cheap — the heavy triangulation happens once, and each
new solve is two quick substitutions. This is exactly why a general linear solver
like scipy.linalg.solve factors the matrix once with LU under the hood rather
than re-running elimination per right-hand side.
The factorisation
Concretely: when you clear an entry by the operation R2 -> R2 − m·R1, the number m
is a multiplier. U is what A becomes after all such operations; L is the identity
matrix with each multiplier m slotted into the position it cleared. That is the entire
construction — no extra computation beyond the elimination you already do.
To then solve Ax = b, write it as L(Ux) = b and split into two triangular solves:
- Forward-substitution on
Ly = b(top row down) recoversy. - Back-substitution on
Ux = y(bottom row up) recoversx.
Each substitution costs about O(n^2), while the factorisation costs O(n^3) and is
done only once. So for k different right-hand sides the total is one O(n^3)
factorisation plus k cheap O(n^2) solves — far better than re-eliminating k times.
That reuse is the whole reason LU exists.
How GATE asks this
Expect either an MCQ on the idea (“what does L store?”, “why use LU instead of
re-running elimination?”) or a small NAT that hands you a 2-by-2 or 3-by-3 matrix and
asks for one entry of L or U after factorisation. The mechanics are the whole
game: run elimination, read U off the result, and read each L entry off the
multiplier that cleared that position. Pivoting rarely appears beyond a one-line “it
exists for stability” fact.
Worked example
Factor A = [[2, 1], [4, 3]].
A = [ 2 1 ]
[ 4 3 ]
eliminate the (2,1) entry: multiplier m = 4 / 2 = 2
R2 -> R2 - 2·R1:
U = [ 2 1 ] L = [ 1 0 ] (slot m = 2 into position (2,1),
[ 0 1 ] [ 2 1 ] keep 1s on the diagonal)
The single multiplier was 4/2 = 2, so L has a 2 under its diagonal and U is the
echelon form. Verify L · U = A:
[ 1 0 ] [ 2 1 ] [ 1·2 + 0·0 1·1 + 0·1 ] [ 2 1 ]
[ 2 1 ] [ 0 1 ] = [ 2·2 + 1·0 2·1 + 1·1 ] = [ 4 3 ] = A ✓
So L = [[1, 0], [2, 1]] and U = [[2, 1], [0, 1]].
Quick check
Quick check
Practice this in an interview
All questionsGELU (Gaussian Error Linear Unit) multiplies the input by the probability that a standard Gaussian random variable is smaller than it, producing a smooth, non-monotonic curve that approximates ReLU but with a stochastic regularization flavor. Transformers favor GELU because the smooth gradient near zero improves optimization in deep attention-based architectures.
Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.
OLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.
L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.