What is SwiGLU and why did modern LLMs replace the ReLU/GELU MLP with it?

SwiGLU is a gated feed-forward layer: it projects the input into two paths, passes one (the gate) through SiLU, multiplies the two elementwise, then projects down — SiLU(x·W_gate) ⊙ (x·W_up) · W_down. The elementwise gate is a smooth, learned, per-feature volume control, so the network can decide how much of each feature to pass rather than a hard ReLU on/off. It gives better quality per parameter, which is why LLaMA, Mistral, and Qwen use it; to keep the parameter budget equal it uses a smaller (~2/3) hidden width since it has three weight matrices instead of two.

What is GELU and why does it outperform ReLU in transformer models?

GELU (Gaussian Error Linear Unit) multiplies the input by the probability that a standard Gaussian random variable is smaller than it, producing a smooth, non-monotonic curve that approximates ReLU but with a stochastic regularization flavor. Transformers favor GELU because the smooth gradient near zero improves optimization in deep attention-based architectures.

What is GGUF, and what does a quantization tier like Q4_K_M mean?

GGUF is a single-file format for running LLMs locally, used by llama.cpp and Ollama. Unlike training-oriented formats, it packs weights, tokenizer, and metadata into one memory-mappable file optimized for inference on CPU or partial GPU. Q4_K_M describes the quantization: roughly 4 bits per weight (vs 16 for FP16), using the k-quant method, medium variant, which protects the most important tensors at higher precision. It is the community default because it keeps almost all of the model's quality at about a quarter of the FP16 size.

What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

LU Decomposition — GATE DA

LU Decomposition

Elimination does a pile of work to triangulate a matrix, then we throw it away. LU keeps the receipt: A = L·U splits into a lower- and an upper-triangular piece, so solving Ax = b for many right-hand sides becomes two cheap substitutions instead of repeating the whole elimination.

6 min read Intermediate GATE DA Lesson 30 of 122

What you'll learn

LU factorisation: A = L·U with L lower-triangular (unit diagonal) and U upper-triangular

U is the row-echelon form; L stores the elimination multipliers

Why LU: solve Ax = b cheaply for many b via forward- then back-substitution

Pivoting (PA = LU) for numerical stability and when a zero pivot appears

The last lesson promised that the row-elimination from lesson four could be bottled and reused. Here is the small annoyance that makes that worth doing: Gaussian elimination does a pile of work to triangulate a matrix, and then we throw every bit of it away. LU decomposition is elimination that keeps the receipt. You split A into two triangular pieces, A = L · U. The U is the upper-triangular echelon form elimination already hands you; the L is a lower-triangular matrix (with 1s on its diagonal) that records the multipliers you used on the way.

The factorisation

L holds the elimination multipliers below its unit diagonal; U is the triangular result of elimination.

The construction adds nothing to the elimination you already run. When you clear an entry with R2 → R2 − m·R1, the number m is a multiplier; U is what A becomes once all such operations are done, and L is the identity with each multiplier m slotted into the exact position it cleared. To then solve Ax = b, rewrite it as L(Ux) = b and split into two triangular solves:

Forward-substitution on Ly = b (top row down) recovers y.
Back-substitution on Ux = y (bottom row up) recovers x.

Each substitution costs about O(n²), while the factorisation costs O(n³) and is done once. So for k different right-hand sides the bill is one O(n³) factorisation plus k cheap O(n²) solves — far better than re-eliminating k times. That reuse is the whole reason LU exists, and why a solver like scipy.linalg.solve factors once under the hood.

A worked example

Factor A = [[2, 1], [4, 3]]:

A = [ 2  1 ]     clear (2,1): multiplier m = 4 / 2 = 2,  R2 → R2 − 2·R1
    [ 4  3 ]

U = [ 2  1 ]     L = [ 1  0 ]     (slot m = 2 into position (2,1); keep 1s on L's diagonal)
    [ 0  1 ]         [ 2  1 ]

verify L·U:  [ 1 0 ][ 2 1 ]   [ 2          1        ]   [ 2  1 ]
             [ 2 1 ][ 0 1 ] = [ 2·2+1·0   2·1+1·1   ] = [ 4  3 ] = A  ✓

So L = [[1, 0], [2, 1]] and U = [[2, 1], [0, 1]] — the single multiplier 4/2 = 2 lives in L, the pivots stay in U.

A question to carry forward

LU broke a matrix into two triangular factors to tame it. Sometimes, though, it is the matrix itself we want to break — chop a big one into a grid of smaller rectangular blocks and treat each block as a single entry. Here is the thread onward: can you add and multiply matrices block by block as if the blocks were ordinary numbers, and when does that shortcut hold?

In one breath

LU: A = L·U — elimination that keeps the receipt; U = the upper-triangular echelon form, L = lower-triangular multipliers with a unit diagonal.
Build it free: the multiplier m from R2 → R2 − m·R1 slots into L at the position it cleared; U is the eliminated A.
The win is reuse: factor once at O(n³), then each b is a forward solve Ly=b + back solve Ux=y, both O(n²). Great for many right-hand sides.
Don’t put pivots in L (they live in U); L’s diagonal is all 1s.
Zero pivot ⇒ plain LU fails; pivot (swap rows) and factor PA = LU — also used for stability.

Practice

Quick check

0/6

Q1Recall: in A = L·U, what does L store and what does U equal?

Q2Trace: factor A = [[3, 6], [1, 5]] as L·U with unit diagonal on L. Enter the (2,1) entry of L.numerical answer — type a number

Q3Trace: for A = [[2, 1], [4, 3]] with L = [[1, 0], [2, 1]], what is the (2,2) entry of U?numerical answer — type a number

Q4Apply: which statements about A = L·U are correct? (select all that apply)select all that apply

Q5Apply: solving Ax = b for many different b with the same A — why prefer LU over re-running elimination each time?

Q6Create: A = [[0, 2], [1, 3]]. Explain why plain A = L·U fails here, and write the fix in one line.

LU Decomposition

What you'll learn

Before you start

The factorisation

A worked example

A question to carry forward

In one breath

Practice

Quick check

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further