Deep Learning Medium Asked at GoogleAsked at MetaAsked at NVIDIAAsked at MicrosoftAsked at Amazon

What are skip connections in ResNet and why were they necessary?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Skip connections add the input of a block directly to its output — bypassing the conv-BN-ReLU stack — so gradients can flow straight back to early layers without passing through every weight matrix. They solved the degradation problem that made deeper plain networks perform worse than shallower ones, not because of overfitting but because of optimisation difficulty.

How to think about it

Most candidates can define a skip connection; the ones who stand out explain why adding depth hurt before ResNet and what the residual formulation actually changes mathematically.

The degradation problem

Before ResNet (He et al., 2015), simply adding more layers made training error go up. This wasn’t overfitting — the training error was higher too. A deeper plain network should be at least as good as a shallower one (the extra layers could just learn identity functions), yet in practice optimisers couldn’t find that solution.

The residual formulation

A residual block computes:

y = F(x, {Wi}) + x

where F is the residual mapping (two or three conv layers) and x is the shortcut. The network only needs to learn F = desired_output - x. If the identity is optimal, F can be driven to zero easily — that’s a much simpler target for gradient descent than learning the full identity transform through a stack of nonlinear layers.

The shortcut adds x directly to F(x). If the optimal function is near-identity, F only needs to learn a small residual correction.

Gradient highway

During backprop, the gradient of the loss with respect to x has an additive term of 1 (from the identity shortcut) regardless of the conv weights. This prevents the multiplicative vanishing that plagues deep plain networks and allows training 100+ layer models stably.

Projection shortcuts

When the skip and the block output have different channel counts (after a stride-2 layer), a 1 x 1 conv with stride 2 is applied to x before addition to match dimensions.

Learn it properly Convolutional neural networks