datarekha
Deep Learning Easy Asked at GoogleAsked at OpenAIAsked at Microsoft

Why do neural networks need activation functions at all?

The short answer

Without a non-linear activation, any stack of linear layers collapses to a single linear transformation, giving a model no more expressive than logistic regression. Activation functions break linearity so the network can approximate arbitrarily complex functions.

How to think about it

Consider a network with two linear layers and no activation:

output = W2 · (W1 · x + b1) + b2
       = (W2 W1) · x + (W2 b1 + b2)
       = W_eff · x + b_eff

The composition of two linear maps is still a linear map. You could repeat this for 100 layers — the result is always a single matrix multiply. Depth buys nothing without non-linearity.

An activation function f breaks this: f(W2 · f(W1 · x + b1) + b2) cannot be reduced to a single affine form. The network gains the ability to carve non-linear decision boundaries, model interactions between features, and, per the Universal Approximation Theorem, represent any continuous function on a compact domain.

What makes a good activation?

  • Non-linear (obvious).
  • Differentiable (or at least sub-differentiable) — needed for backpropagation.
  • Computationally cheap.
  • Does not saturate everywhere — saturated regions kill gradients.

The choice of activation (ReLU, GELU, sigmoid) matters a great deal for training dynamics, but any reasonable non-linearity is far better than none.

Learn it properly Activation functions

Keep practising

All Deep Learning questions

Explore further

Skip to content