How do you count the number of trainable parameters in a convolutional layer?
Each filter has k*k*C_in weights plus one bias, and a layer with C_out filters therefore has (k*k*C_in + 1)*C_out parameters. This count is independent of the input's spatial dimensions H and W, which is what makes CNNs so parameter-efficient.
How to think about it
Interviewers give a concrete layer spec and ask you to compute on the spot. Know the formula cold and practise a few examples.
The formula
A conv layer is defined by kernel size k, input channels C_in, and output channels C_out:
params = (k * k * C_in + 1) * C_out
The +1 is the per-filter bias. If bias=False (common when followed by BatchNorm), drop it:
params = k * k * C_in * C_out
Spatial dimensions H and W do not appear — the same kernel tiles across the whole map.
Worked examples
Example 1 — typical first conv layer
Input 224 x 224 x 3, kernel 7 x 7, 64 filters, bias=True:
(7 * 7 * 3 + 1) * 64 = (147 + 1) * 64 = 9,472
Example 2 — deeper layer
Input 28 x 28 x 128, kernel 3 x 3, 256 filters, bias=False:
3 * 3 * 128 * 256 = 294,912
Example 3 — 1×1 convolution
Input 14 x 14 x 512, kernel 1 x 1, 128 filters, bias=False:
1 * 1 * 512 * 128 = 65,536
Contrast with a dense layer
A fully-connected layer mapping 512 units to 256 units needs 512 * 256 + 256 = 131,328 parameters, and the count scales with both input and output size. A conv layer’s count scales only with kernel area and channel counts — not spatial resolution.
BatchNorm parameters
If a BatchNorm layer follows the conv, it adds 2 * C_out learnable parameters (scale γ and shift β, one pair per channel) plus two non-learned running statistics (mean and variance) that are not updated by backprop.