Why do CNNs outperform fully-connected networks on image data?
CNNs exploit three structural properties of images — local correlation, translation invariance, and compositional hierarchy — through parameter sharing and local receptive fields. A dense network treats every pixel as independent, ignoring spatial structure and requiring orders of magnitude more parameters.
How to think about it
Three concrete arguments win this question: parameter count, inductive bias, and generalisation. Walk through all three.
1. Parameter sharing
In a dense layer connecting a 224 x 224 x 3 image to 1 000 hidden units, you need 224*224*3*1000 = 150,528,000 weights — just for the first layer. A conv layer with 64 filters of size 3 x 3 x 3 needs only (3*3*3 + 1)*64 = 1,792 parameters, and those same weights detect the same feature everywhere in the image.
2. Local receptive fields
Adjacent pixels are strongly correlated; distant ones often aren’t. Dense layers must learn this correlation from scratch, spending capacity on useless long-range connections. A 3×3 kernel only connects to a 3×3 neighborhood, hardwiring the right inductive bias into the architecture.
3. Translation invariance (approximate)
Because the same kernel is applied at every position, a detector that fires on a horizontal edge fires regardless of whether the edge sits at the top or bottom of the image. Spatial pooling after each conv block makes this invariance even stronger. Dense networks have no such built-in invariance — they must learn it, requiring far more data.
4. Hierarchical feature learning
Stacked conv layers build a feature hierarchy: edges → textures → parts → objects. Each layer sees the composed features from the layer below. Dense networks have no spatial compositionality; every layer must re-encode spatial structure from scratch.
| Property | Dense net | CNN |
|---|---|---|
| Parameters (first layer, 224×224×3 input) | ~150 M | ~1 800 |
| Spatial locality | None (fully connected) | Built in via kernel size |
| Translation invariance | Learned, data-hungry | Structural |
| Depth efficiency | Low | High (hierarchy) |