datarekha
Deep Learning Easy Asked at GoogleAsked at MetaAsked at AmazonAsked at Microsoft

Why do CNNs outperform fully-connected networks on image data?

The short answer

CNNs exploit three structural properties of images — local correlation, translation invariance, and compositional hierarchy — through parameter sharing and local receptive fields. A dense network treats every pixel as independent, ignoring spatial structure and requiring orders of magnitude more parameters.

How to think about it

Three concrete arguments win this question: parameter count, inductive bias, and generalisation. Walk through all three.

1. Parameter sharing

In a dense layer connecting a 224 x 224 x 3 image to 1 000 hidden units, you need 224*224*3*1000 = 150,528,000 weights — just for the first layer. A conv layer with 64 filters of size 3 x 3 x 3 needs only (3*3*3 + 1)*64 = 1,792 parameters, and those same weights detect the same feature everywhere in the image.

2. Local receptive fields

Adjacent pixels are strongly correlated; distant ones often aren’t. Dense layers must learn this correlation from scratch, spending capacity on useless long-range connections. A 3×3 kernel only connects to a 3×3 neighborhood, hardwiring the right inductive bias into the architecture.

3. Translation invariance (approximate)

Because the same kernel is applied at every position, a detector that fires on a horizontal edge fires regardless of whether the edge sits at the top or bottom of the image. Spatial pooling after each conv block makes this invariance even stronger. Dense networks have no such built-in invariance — they must learn it, requiring far more data.

4. Hierarchical feature learning

Stacked conv layers build a feature hierarchy: edges → textures → parts → objects. Each layer sees the composed features from the layer below. Dense networks have no spatial compositionality; every layer must re-encode spatial structure from scratch.

PropertyDense netCNN
Parameters (first layer, 224×224×3 input)~150 M~1 800
Spatial localityNone (fully connected)Built in via kernel size
Translation invarianceLearned, data-hungryStructural
Depth efficiencyLowHigh (hierarchy)
Learn it properly PyTorch quickstart

Keep practising

All Deep Learning questions

Explore further

Skip to content