Deep Learning Easy Asked at GoogleAsked at MetaAsked at AmazonAsked at Microsoft

Why do CNNs outperform fully-connected networks on image data?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

CNNs exploit three structural properties of images — local correlation, translation invariance, and compositional hierarchy — through parameter sharing and local receptive fields. A dense network treats every pixel as independent, ignoring spatial structure and requiring orders of magnitude more parameters.

How to think about it

Three concrete arguments win this question: parameter count, inductive bias, and generalisation. Walk through all three.

In a dense layer connecting a 224 x 224 x 3 image to 1 000 hidden units, you need 224*224*3*1000 = 150,528,000 weights — just for the first layer. A conv layer with 64 filters of size 3 x 3 x 3 needs only (3*3*3 + 1)*64 = 1,792 parameters, and those same weights detect the same feature everywhere in the image.

2. Local receptive fields

Adjacent pixels are strongly correlated; distant ones often aren’t. Dense layers must learn this correlation from scratch, spending capacity on useless long-range connections. A 3×3 kernel only connects to a 3×3 neighborhood, hardwiring the right inductive bias into the architecture.

3. Translation invariance (approximate)

Because the same kernel is applied at every position, a detector that fires on a horizontal edge fires regardless of whether the edge sits at the top or bottom of the image. Spatial pooling after each conv block makes this invariance even stronger. Dense networks have no such built-in invariance — they must learn it, requiring far more data.

4. Hierarchical feature learning

Stacked conv layers build a feature hierarchy: edges → textures → parts → objects. Each layer sees the composed features from the layer below. Dense networks have no spatial compositionality; every layer must re-encode spatial structure from scratch.

Property	Dense net	CNN
Parameters (first layer, 224×224×3 input)	~150 M	~1 800
Spatial locality	None (fully connected)	Built in via kernel size
Translation invariance	Learned, data-hungry	Structural
Depth efficiency	Low	High (hierarchy)

Learn it properly Convolutional neural networks

Why do CNNs outperform fully-connected networks on image data?

2. Local receptive fields

3. Translation invariance (approximate)

4. Hierarchical feature learning

Keep practising

Explore further

Why do CNNs outperform fully-connected networks on image data?

1. Parameter sharing

2. Local receptive fields

3. Translation invariance (approximate)

4. Hierarchical feature learning

Keep practising

Explore further