datarekha
Deep Learning Medium Asked at GoogleAsked at MetaAsked at NVIDIAAsked at MicrosoftAsked at AmazonAsked at Tesla

Describe the ResNet architecture and explain the key design choices that made it work.

The short answer

ResNet (He et al., 2015) stacks residual blocks that add the input directly to the block's output, enabling stable training of 50–152+ layer networks. The key design choices are the residual shortcut for gradient flow, batch normalisation after each convolution, and bottleneck blocks (1x1 → 3x3 → 1x1) that control parameter count in deeper variants.

How to think about it

Know the variants by parameter count, understand the bottleneck block precisely, and be able to explain what changed compared to VGG — that progression shows architectural intuition.

Context: the problem ResNet solved

Plain networks (VGG-style) hit a performance ceiling around 20–30 layers. Adding more layers made training error go up — not from overfitting but because the optimiser struggled to find identity-like mappings through deep stacks of non-linearities. ResNet reframes the target: instead of learning the full mapping H(x), each block learns a residual F(x) = H(x) - x.

Residual block (ResNet-18 / ResNet-34)

Input x
  |
Conv 3x3 → BN → ReLU
  |
Conv 3x3 → BN
  |        ↑
  +--------+ (shortcut: identity or 1x1 proj)
  |
ReLU
→ output y = F(x) + x

Two 3×3 convolutions, each followed by batch normalisation. ReLU is applied after the addition, not before.

Bottleneck block (ResNet-50 / 101 / 152)

Input x (256 ch)
  |
Conv 1x1 → BN → ReLU  (256 → 64 ch)
  |
Conv 3x3 → BN → ReLU  (64 → 64 ch, spatial work)
  |
Conv 1x1 → BN          (64 → 256 ch)
  |        ↑
  +--------+ shortcut
  |
ReLU

Parameter count for one bottleneck (bias=False):

(1*1*256*64) + (3*3*64*64) + (1*1*64*256) = 16384 + 36864 + 16384 = 69,632

Compare with two 3×3 convs on 256 channels: 2 * (9*256*256) = 1,179,648 — 17× more expensive.

Key architectural constants

VariantBlocksParamsTop-1 ImageNet
ResNet-18[2,2,2,2] basic11.7 M69.8%
ResNet-50[3,4,6,3] bottleneck25.6 M76.1%
ResNet-101[3,4,23,3] bottleneck44.5 M77.4%
ResNet-152[3,8,36,3] bottleneck60.2 M78.3%

Downsampling strategy

At the start of each stage, the first block uses stride=2 in the 3×3 conv to halve spatial dimensions. The shortcut uses a 1 x 1 conv with stride=2 (projection shortcut) to match the new shape.

Learn it properly Dropout, BN, LN

Keep practising

All Deep Learning questions

Explore further

Skip to content