Question 1

What is data augmentation in computer vision and which techniques are most effective?

Accepted Answer

Data augmentation artificially expands the training set by applying label-preserving transformations to existing images, improving generalisation and regularisation without collecting more data. Geometric transforms (flip, crop, rotation) and colour jitter are universally effective; stronger methods like CutMix, MixUp, and RandAugment consistently improve accuracy on top of basic augmentation.

Question 2

When should you use deep learning vs classical machine learning?

Accepted Answer

Deep learning wins when data is abundant, inputs are unstructured (images, text, audio), and features are hard to engineer by hand. Classical ML wins on structured tabular data, small datasets, and when interpretability or training speed matter.

Question 3

How does dropout work, and why must it behave differently during training and inference?

Accepted Answer

Dropout randomly zeroes each neuron's output with probability p during training, forcing the network to learn redundant representations and preventing co-adaptation of neurons. At inference, dropout is disabled and all neurons are active — but to keep expected activations the same as during training, outputs are scaled by 1/(1−p). Forgetting to switch modes produces incorrect, noisy predictions.

Question 4

What is early stopping, and how does it prevent overfitting?

Accepted Answer

Early stopping monitors validation loss after each epoch and halts training when it has not improved for a set number of epochs (the patience). It prevents the model from memorising training data past the point of best generalisation, acting as a free regulariser that requires no change to the model or loss function.

Question 5

What is the difference between an epoch, an iteration, and a step in deep learning training?

Accepted Answer

An epoch is one complete pass through the entire training dataset. An iteration (or step) is one forward-backward pass on a single mini-batch. The number of iterations per epoch equals the dataset size divided by the batch size. These distinctions matter when comparing runs with different batch sizes, reporting training progress, and configuring learning rate schedules.

Question 6

Explain the bias-variance tradeoff and how it relates to overfitting.

Accepted Answer

Bias is error from overly simple assumptions (underfitting) and variance is error from sensitivity to training-data noise (overfitting); reducing one often increases the other. An overfit model has low bias but high variance, so techniques like regularization, more data, and simpler models trade a little bias for a large reduction in variance.

Question 7

What are filters and feature maps in a CNN, and what do they represent?

Accepted Answer

A filter (kernel) is the set of learned weights that the network applies at each spatial position; a feature map is the spatial grid of responses produced when one filter slides over the input. Each filter detects one type of pattern, and the full stack of feature maps across all filters constitutes the layer's output representation.

Question 8

Walk me through the forward pass of a neural network end-to-end.

Accepted Answer

The forward pass feeds an input through every layer in sequence: each layer computes a linear transform followed by an activation, caching the intermediate values needed later for backpropagation. The final layer produces a prediction, which is compared to the label via a loss function.

Question 9

What is gradient clipping, and when is it necessary?

Accepted Answer

Gradient clipping caps the norm (or per-element value) of gradients before the optimiser step, preventing any single update from being so large that it destabilises training. It is especially important in recurrent networks and transformers where gradients can explode across many time steps or attention heads, and in any network trained with a high learning rate on noisy data.

Question 10

What is L2 regularisation (weight decay), and how does it reduce overfitting?

Accepted Answer

L2 regularisation adds a penalty equal to the sum of squared weights multiplied by a coefficient λ to the loss function, which encourages the optimiser to keep weights small. This penalises large, specialised weights and pushes the model toward simpler solutions that generalise better. In SGD it is equivalent to shrinking weights by a constant factor each step (hence weight decay), though in Adam the two diverge — requiring AdamW for correct decoupled decay.

Question 11

What is pooling, and when would you choose max pooling over average pooling?

Accepted Answer

Pooling downsamples a feature map by aggregating values in a local window, reducing spatial dimensions and building position tolerance. Max pooling takes the strongest activation in each window; average pooling takes the mean. Max pooling dominates in classification backbones; average pooling is preferred in global summarisation and smooth feature maps.

Question 12

Why does training loss keep falling while validation loss rises?

Accepted Answer

This divergence is the signature of overfitting: the model has enough capacity to memorise training-set specifics — noise, label errors, dataset-specific patterns — that do not generalise. Training loss measures fit to what has already been seen; validation loss measures generalisation to held-out data. As the model memorises rather than learns structure, it scores better on training data and worse on everything else.

Question 13

In a PyTorch training loop, why do you need to call optimizer.zero_grad() before backpropagation?

Accepted Answer

PyTorch accumulates gradients by default, adding new gradients to whatever is already stored in each parameter's .grad. If you do not zero them out each iteration, gradients from previous batches mix with the current batch and corrupt the weight updates. zero_grad() resets gradients to zero so each step uses only the current batch's signal.

Question 14

What do the query, key, and value vectors represent in attention?

Accepted Answer

The query represents what a token is looking for, the key represents what a token is advertising about itself, and the value is the content it contributes if selected. Attention scores measure query-key compatibility, and the output is a soft retrieval: a weighted sum of values where the weights come from those compatibility scores.

Question 15

Why do we scale by sqrt(d_k) in scaled dot-product attention?

Accepted Answer

For large key dimensions, the dot products between query and key vectors grow in magnitude proportionally to d_k, pushing the softmax into regions with very small gradients. Dividing by sqrt(d_k) keeps the pre-softmax scores at unit variance regardless of dimension, stabilising training.

Question 16

What does softmax do, and why is it used in the output layer?

Accepted Answer

Softmax converts a vector of raw scores (logits) into a valid probability distribution — all values positive and summing to one — by exponentiating each score and normalising by the total. It is used in classification output layers because the resulting probabilities pair naturally with cross-entropy loss and allow confident predictions to dominate while preserving the relative ordering of logits.

Question 17

What do stride and padding control in a convolutional layer?

Accepted Answer

Stride sets how many positions the kernel jumps between applications, controlling output resolution — stride 2 roughly halves spatial dimensions. Padding adds values (usually zeros) around the border so the kernel can be applied to edge pixels, letting you choose whether the output shrinks, stays the same, or (rarely) grows relative to the input.

Question 18

What are embeddings and why are they central to modern deep learning?

Accepted Answer

An embedding is a dense, learned vector representation of a discrete or high-dimensional object — a word, image, user, product — in a continuous low-dimensional space. Proximity in embedding space reflects semantic or behavioural similarity, making embeddings a universal interface between raw data and neural networks.

Question 19

What does a single artificial neuron (perceptron) actually compute?

Accepted Answer

A neuron takes a weighted sum of its inputs, adds a bias, and passes the result through an activation function. The weights encode learned feature importance, the bias shifts the decision boundary, and the activation introduces the non-linearity needed for complex mappings.

Question 20

What does a convolution operation do in a CNN?

Accepted Answer

A convolution slides a small learned weight matrix (kernel) across the input, computing a dot product at each position to produce a feature map. Each kernel learns to detect one spatial pattern — an edge, a corner, a texture — regardless of where it appears in the image.

Question 21

Why do neural networks need activation functions at all?

Accepted Answer

Without a non-linear activation, any stack of linear layers collapses to a single linear transformation, giving a model no more expressive than logistic regression. Activation functions break linearity so the network can approximate arbitrarily complex functions.

Question 22

Why do CNNs outperform fully-connected networks on image data?

Accepted Answer

CNNs exploit three structural properties of images — local correlation, translation invariance, and compositional hierarchy — through parameter sharing and local receptive fields. A dense network treats every pixel as independent, ignoring spatial structure and requiring orders of magnitude more parameters.

Question 23

Why are GPUs used for deep learning instead of CPUs?

Accepted Answer

Neural network training is dominated by large matrix multiplications that are embarrassingly parallel. GPUs have thousands of small cores optimised for this exact operation, whereas CPUs have tens of powerful cores optimised for low-latency sequential logic. The throughput difference is 10–100x for typical DL workloads.

Question 24

Why does a transformer need positional encoding?

Accepted Answer

Self-attention computes a weighted sum over value vectors where the weights depend only on dot products between queries and keys — there is no notion of position in this operation. Without an explicit positional signal injected into the token embeddings, the model cannot distinguish 'the dog bit the man' from 'the man bit the dog'.

Question 25

What is an autoencoder and what is it used for?

Accepted Answer

An autoencoder is a neural network trained to compress input into a low-dimensional bottleneck (encoder) and then reconstruct the original input from that bottleneck (decoder). It learns a compact representation without labels, making it useful for dimensionality reduction, anomaly detection, and as a component of generative models.

Question 26

What is batch normalisation, and why does it help training?

Accepted Answer

Batch normalisation normalises each feature across the mini-batch to zero mean and unit variance, then applies learnable scale and shift parameters. It stabilises internal activation distributions — reducing internal covariate shift — which allows higher learning rates, reduces dependence on careful weight initialisation, and provides mild regularisation through the noise in batch statistics.

Question 27

How does batch size affect training — speed, convergence, and generalisation?

Accepted Answer

Larger batches give more accurate gradient estimates and enable higher GPU utilisation, but they tend to converge to sharper minima that generalise worse. Smaller batches introduce gradient noise that acts as implicit regularisation, helping the optimiser escape sharp minima and often finding flatter, better-generalising solutions — at the cost of slower wall-clock training per epoch.

Question 28

How are batch size and learning rate related, and what is learning-rate warmup?

Accepted Answer

Larger batches give lower-variance gradient estimates, so they typically allow and often need a proportionally larger learning rate, while very high learning rates early in training can destabilize it. Warmup ramps the learning rate up from a small value over the first steps to avoid early divergence, then follows a decay schedule.

Question 29

What is the difference between activation checkpointing and gradient accumulation?

Accepted Answer

Activation checkpointing reduces saved activations inside one micro-batch by recomputing forward regions during backward. Gradient accumulation reduces the micro-batch processed at once and sums gradients across several micro-steps before one optimizer update, preserving a larger effective batch. They target different memory pressure and can be combined.

Question 30

How do you handle severe class imbalance when training a deep learning model?

Accepted Answer

Class imbalance causes the model to exploit the majority class and ignore the minority. The main levers are loss reweighting, oversampling or undersampling, focal loss, and using the right evaluation metric — accuracy is useless; use F1, precision-recall AUC, or MCC.

Question 31

How do you count the number of trainable parameters in a convolutional layer?

Accepted Answer

Each filter has k*k*C_in weights plus one bias, and a layer with C_out filters therefore has (k*k*C_in + 1)*C_out parameters. This count is independent of the input's spatial dimensions H and W, which is what makes CNNs so parameter-efficient.

Question 32

Why use cross-entropy loss instead of MSE for classification?

Accepted Answer

MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.

Question 33

What is the dying ReLU problem and how do you prevent it?

Accepted Answer

A ReLU neuron dies when its pre-activation is permanently negative for every training example, making its gradient exactly zero and freezing the neuron forever. Large learning rates or poorly initialized weights are the usual causes; leaky ReLU, parametric ReLU, or ELU provide sub-zero gradients that keep neurons recoverable.

Question 34

What is the difference between Adam and AdamW?

Accepted Answer

Adam combines momentum and per-parameter adaptive learning rates, but its L2 regularization gets entangled with the adaptive scaling. AdamW decouples weight decay from the gradient-based update, applying decay directly to the weights, which yields better generalization and is the standard optimizer for training transformers.

Question 35

What is the difference between batch normalization and layer normalization, and why do transformers use layer norm?

Accepted Answer

Batch norm normalizes each feature across the samples in a batch, so it depends on batch statistics and behaves differently in training versus inference; layer norm normalizes across the features of a single example, independent of batch size. Transformers use layer norm because sequence models have variable lengths and small or varying batches, where per-example normalization is more stable.

Question 36

What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures?

Accepted Answer

Encoder-only models like BERT use bidirectional attention and are best for understanding and classification; decoder-only models like GPT use causal masked attention for autoregressive generation; encoder-decoder models like T5 encode an input then attend to it while decoding, suiting sequence-to-sequence tasks like translation.

Question 37

What is the difference between Xavier (Glorot) and He initialization, and when do you use each?

Accepted Answer

Both scale initial weights based on layer fan-in and fan-out to keep activation and gradient variance stable across layers. Xavier (Glorot) assumes a symmetric activation like tanh or sigmoid, while He initialization uses a larger variance tuned for ReLU-family activations, which zero out half their inputs. Use Xavier with tanh or sigmoid and He with ReLU or LeakyReLU.

Question 38

What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures?

Accepted Answer

Encoder-only models build bidirectional context representations suited for classification and embedding tasks. Decoder-only models generate text autoregressively using causal (masked) self-attention and dominate language modelling. Encoder-decoder models use a full encoder to encode the source and a decoder with cross-attention to generate the target, fitting sequence-to-sequence tasks like translation and summarisation.

Question 39

Explain self-attention and the roles of the Query, Key, and Value vectors.

Accepted Answer

Self-attention lets each token build a representation by attending to every other token: it scores its Query against all Keys, normalizes the scores with softmax, and takes a weighted sum of the Values. Q, K, and V are learned linear projections of the input that respectively represent what a token is looking for, what it offers as a match key, and the content it contributes.

Question 40

What causes exploding gradients and how is gradient clipping a fix?

Accepted Answer

Exploding gradients happen when the product of layer Jacobians has spectral norm greater than 1, causing gradients to grow exponentially with depth. Gradient clipping rescales the gradient norm to a maximum threshold before the weight update, preventing divergence without discarding gradient direction.

Question 41

What is GELU and why does it outperform ReLU in transformer models?

Accepted Answer

GELU (Gaussian Error Linear Unit) multiplies the input by the probability that a standard Gaussian random variable is smaller than it, producing a smooth, non-monotonic curve that approximates ReLU but with a stochastic regularization flavor. Transformers favor GELU because the smooth gradient near zero improves optimization in deep attention-based architectures.

Question 42

What is gradient accumulation and when do you need it?

Accepted Answer

Gradient accumulation sums gradients over multiple small forward-backward passes before calling the optimizer, simulating a larger effective batch size without requiring the memory to hold it all at once. It is the standard workaround when the desired batch size does not fit in GPU memory.

Question 43

What is gradient accumulation and why is it useful?

Accepted Answer

Gradient accumulation runs several forward and backward passes without zeroing gradients, sums them, and only steps the optimizer after N micro-batches, simulating a larger effective batch size than fits in memory. It lets you train with large effective batches on limited GPU memory at the cost of more compute per update.

Question 44

What is gradient clipping and when would you use it?

Accepted Answer

Gradient clipping caps the magnitude of gradients (by value or by global norm) before the optimizer step, preventing exploding gradients that cause unstable or diverging training. It is especially useful in RNNs and transformers, where a single large update can destabilize learning.

Question 45

How do you train a deep learning model when you have very little labelled data?

Accepted Answer

Small labelled datasets call for a layered strategy: transfer learning from a pretrained backbone, heavy data augmentation, self-supervised pretraining on unlabelled data, and regularisation to prevent the model memorising the few examples it sees.

Question 46

How does gradient checkpointing reduce GPU memory, and what is the trade-off?

Accepted Answer

Gradient checkpointing—more precisely activation checkpointing—keeps selected forward boundaries instead of every intermediate activation. During backward it re-runs checkpointed forward regions to reconstruct discarded tensors. It reduces activation memory but leaves parameters, gradients and optimizer state unchanged, trading extra computation for lower peak VRAM.

Question 47

What is a learning rate schedule, and why is warmup important?

Accepted Answer

A learning rate schedule changes the learning rate during training rather than keeping it fixed. Warmup starts with a very small LR and ramps it up over the first few hundred or thousand steps, preventing early large gradient updates from destabilising freshly initialised weights. After warmup, the LR is typically decayed — via cosine annealing, step decay, or linear decay — so the optimiser can settle into a sharp minimum.

Question 48

How do LSTM gates solve the vanishing gradient problem?

Accepted Answer

An LSTM maintains a cell state that flows through time via additive updates controlled by learned gates, giving gradients a near-linear path across many steps. The forget, input, and output gates let the network selectively retain, write, and expose information rather than crushing every signal through a squashing non-linearity at every step.

Question 49

What is mixed precision training and why does it matter?

Accepted Answer

Mixed precision training stores weights and activations in float16 (or bfloat16) for forward/backward passes while keeping a float32 master copy of weights for the update step. This halves memory usage and delivers 2–4x throughput on modern tensor cores, with negligible accuracy loss when used with loss scaling.

Question 50

Why use multiple attention heads instead of one large attention operation?

Accepted Answer

Multiple heads let the model simultaneously attend to different types of relationships — syntactic, semantic, coreference, positional — within the same layer. A single head produces a single weighted mixture and can only represent one relational pattern per layer; splitting into h heads and projecting to lower dimensions gives h independent subspaces for pattern capture at the same total parameter cost.