Why do neural networks need activation functions at all?
Without a non-linear activation, any stack of linear layers collapses to a single linear transformation, giving a model no more expressive than logistic regression. Activation functions break linearity so the network can approximate arbitrarily complex functions.
How to think about it
Consider a network with two linear layers and no activation:
output = W2 · (W1 · x + b1) + b2
= (W2 W1) · x + (W2 b1 + b2)
= W_eff · x + b_eff
The composition of two linear maps is still a linear map. You could repeat this for 100 layers — the result is always a single matrix multiply. Depth buys nothing without non-linearity.
An activation function f breaks this: f(W2 · f(W1 · x + b1) + b2) cannot be reduced to a single affine form. The network gains the ability to carve non-linear decision boundaries, model interactions between features, and, per the Universal Approximation Theorem, represent any continuous function on a compact domain.
What makes a good activation?
- Non-linear (obvious).
- Differentiable (or at least sub-differentiable) — needed for backpropagation.
- Computationally cheap.
- Does not saturate everywhere — saturated regions kill gradients.
The choice of activation (ReLU, GELU, sigmoid) matters a great deal for training dynamics, but any reasonable non-linearity is far better than none.