What are skip connections in ResNet and why were they necessary?
Skip connections add the input of a block directly to its output — bypassing the conv-BN-ReLU stack — so gradients can flow straight back to early layers without passing through every weight matrix. They solved the degradation problem that made deeper plain networks perform worse than shallower ones, not because of overfitting but because of optimisation difficulty.
How to think about it
Most candidates can define a skip connection; the ones who stand out explain why adding depth hurt before ResNet and what the residual formulation actually changes mathematically.
The degradation problem
Before ResNet (He et al., 2015), simply adding more layers made training error go up. This wasn’t overfitting — the training error was higher too. A deeper plain network should be at least as good as a shallower one (the extra layers could just learn identity functions), yet in practice optimisers couldn’t find that solution.
The residual formulation
A residual block computes:
y = F(x, {Wi}) + x
where F is the residual mapping (two or three conv layers) and x is the shortcut. The network only needs to learn F = desired_output - x. If the identity is optimal, F can be driven to zero easily — that’s a much simpler target for gradient descent than learning the full identity transform through a stack of nonlinear layers.
Gradient highway
During backprop, the gradient of the loss with respect to x has an additive term of 1 (from the identity shortcut) regardless of the conv weights. This prevents the multiplicative vanishing that plagues deep plain networks and allows training 100+ layer models stably.
Projection shortcuts
When the skip and the block output have different channel counts (after a stride-2 layer), a 1 x 1 conv with stride 2 is applied to x before addition to match dimensions.