What is a 1x1 convolution and why is it useful?
A 1x1 convolution applies a learned linear combination across channels at each spatial position, without looking at any spatial neighbourhood. It is used to change the number of channels cheaply, add non-linearity between pointwise operations, and build the bottleneck blocks at the core of Inception and ResNet-50+.
How to think about it
The spatial-vs-channel distinction is the key insight. Once you explain it correctly, show the parameter saving in a bottleneck — that’s what interviewers remember.
What it computes
At every spatial position (i, j), a 1×1 conv computes:
output[i, j, f] = sum over c of input[i, j, c] * W[f, c] + b[f]
It mixes information across the channel dimension only — no neighbourhood, no spatial learning. The result is a learned projection of the channel vector at each pixel.
Parameter count
params = (1 * 1 * C_in + 1) * C_out = (C_in + 1) * C_out
This is dramatically cheaper than a 3×3 conv: for C_in=256, C_out=256, a 3×3 costs (9*256+1)*256 ≈ 590 K parameters; a 1×1 costs only (256+1)*256 ≈ 65 K.
Use cases
Channel bottleneck (ResNet-50, Inception)
Compress channels before an expensive 3×3 conv, then expand again:
256ch → 1x1 → 64ch → 3x3 → 64ch → 1x1 → 256ch
Parameters for the 3×3 drop from (9*256+1)*256 ≈ 590 K to (9*64+1)*64 ≈ 37 K — roughly 16× cheaper.
Projection shortcuts
In ResNet, when a block changes the channel count, a 1×1 conv (with matching stride) aligns the skip connection dimensions before addition.
Pointwise mixing in depthwise-separable convolutions
MobileNet splits a standard conv into: (1) a depthwise conv that processes each channel independently, then (2) a 1×1 pointwise conv that mixes channels. The 1×1 step recovers cross-channel expressiveness at minimal cost.
Non-linearity injection
A 1×1 conv followed by ReLU adds a cheap non-linear transformation, increasing the network’s representational power without spatial cost.