What is pooling, and when would you choose max pooling over average pooling?
Pooling downsamples a feature map by aggregating values in a local window, reducing spatial dimensions and building position tolerance. Max pooling takes the strongest activation in each window; average pooling takes the mean. Max pooling dominates in classification backbones; average pooling is preferred in global summarisation and smooth feature maps.
How to think about it
Don’t just define them — give the intuition for when each wins, then mention global average pooling as a modern replacement for the fully-connected head.
What pooling does
After a conv layer produces a feature map, pooling applies a reduction over non-overlapping (or strided) windows. For a 2×2 pool with stride 2 on an H x W map:
- Output size:
H/2 x W/2 - No learned parameters
Pooling achieves two things: it compresses the representation (fewer computations downstream) and introduces a degree of local translation tolerance (the feature still fires if the pattern shifts by one pixel inside the window).
Max pooling
output = max(values in window)
Preserves the strongest signal. If any position in the window detected the feature, the map says “yes, feature present here.” This is ideal for classification tasks where you care whether a pattern exists, not its exact sub-window location.
Average pooling
output = mean(values in window)
Spreads the signal uniformly. Useful when you want a smooth, holistic summary of a region rather than the sharpest activation.
Global average pooling (GAP) collapses an entire H x W feature map to a single scalar per channel: output[c] = mean over all H*W positions for channel c. Modern architectures (ResNet, MobileNet) use GAP instead of flattening into a dense layer, drastically cutting parameters and regularising against overfitting.
Comparison
| Max pooling | Average pooling | |
|---|---|---|
| Preserves | Strongest activations | Overall activation level |
| Good for | Object detection signals | Smooth features, GAP |
| Gradients | Only winner gets gradient | Spread across all positions |
| Common use | VGG, early ResNets | GAP head, inception modules |