Describe the ResNet architecture and explain the key design choices that made it work.
ResNet (He et al., 2015) stacks residual blocks that add the input directly to the block's output, enabling stable training of 50–152+ layer networks. The key design choices are the residual shortcut for gradient flow, batch normalisation after each convolution, and bottleneck blocks (1x1 → 3x3 → 1x1) that control parameter count in deeper variants.
How to think about it
Know the variants by parameter count, understand the bottleneck block precisely, and be able to explain what changed compared to VGG — that progression shows architectural intuition.
Context: the problem ResNet solved
Plain networks (VGG-style) hit a performance ceiling around 20–30 layers. Adding more layers made training error go up — not from overfitting but because the optimiser struggled to find identity-like mappings through deep stacks of non-linearities. ResNet reframes the target: instead of learning the full mapping H(x), each block learns a residual F(x) = H(x) - x.
Residual block (ResNet-18 / ResNet-34)
Input x
|
Conv 3x3 → BN → ReLU
|
Conv 3x3 → BN
| ↑
+--------+ (shortcut: identity or 1x1 proj)
|
ReLU
→ output y = F(x) + x
Two 3×3 convolutions, each followed by batch normalisation. ReLU is applied after the addition, not before.
Bottleneck block (ResNet-50 / 101 / 152)
Input x (256 ch)
|
Conv 1x1 → BN → ReLU (256 → 64 ch)
|
Conv 3x3 → BN → ReLU (64 → 64 ch, spatial work)
|
Conv 1x1 → BN (64 → 256 ch)
| ↑
+--------+ shortcut
|
ReLU
Parameter count for one bottleneck (bias=False):
(1*1*256*64) + (3*3*64*64) + (1*1*64*256) = 16384 + 36864 + 16384 = 69,632
Compare with two 3×3 convs on 256 channels: 2 * (9*256*256) = 1,179,648 — 17× more expensive.
Key architectural constants
| Variant | Blocks | Params | Top-1 ImageNet |
|---|---|---|---|
| ResNet-18 | [2,2,2,2] basic | 11.7 M | 69.8% |
| ResNet-50 | [3,4,6,3] bottleneck | 25.6 M | 76.1% |
| ResNet-101 | [3,4,23,3] bottleneck | 44.5 M | 77.4% |
| ResNet-152 | [3,8,36,3] bottleneck | 60.2 M | 78.3% |
Downsampling strategy
At the start of each stage, the first block uses stride=2 in the 3×3 conv to halve spatial dimensions. The shortcut uses a 1 x 1 conv with stride=2 (projection shortcut) to match the new shape.