Deep Learning Medium Asked at GoogleAsked at MetaAsked at Apple

How do you train a deep learning model when you have very little labelled data?

For ML Engineer AI / LLM Engineer Data Scientist

The short answer

Small labelled datasets call for a layered strategy: transfer learning from a pretrained backbone, heavy data augmentation, self-supervised pretraining on unlabelled data, and regularisation to prevent the model memorising the few examples it sees.

How to think about it

“Small” is relative to task complexity. A 500-image binary classifier is feasible; a 500-example multilabel medical segmentation task is not. Know which regime you’re in before choosing a strategy.

Strategy stack — apply roughly in order of impact

1. Transfer learning (highest leverage)

Start from a pretrained backbone — ResNet, ViT, BERT, Whisper — and fine-tune only the head. The backbone already encodes rich representations learned from millions of examples. This is the single most effective intervention.

2. Data augmentation

Expand the effective dataset without new labels. For vision: random crops, flips, colour jitter, Mixup, CutMix. For text: synonym replacement, back-translation, easy data augmentation (EDA). For audio: time-stretch, pitch shift, additive noise.

from torchvision import transforms

train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
    transforms.ColorJitter(brightness=0.3, contrast=0.3),
    transforms.ToTensor(),
])

3. Self-supervised or semi-supervised learning

If you have unlabelled data in the same domain, pretrain or fine-tune the backbone on it using contrastive learning (SimCLR, MoCo) or masked prediction (MAE, BERT MLM). Even 10 k unlabelled images can meaningfully improve a model trained on 500 labelled ones.

4. Strong regularisation

With few examples every training signal is precious — and every noise signal is dangerous. Use dropout, weight decay, early stopping, and label smoothing together.

5. Reconsider classical ML

If you have structured features and fewer than ~2 k labelled examples, XGBoost or a regularised logistic regression will likely outperform any neural net. Do not fight the data regime — match the model to it.

Learn it properly Hugging Face transformers

How do you train a deep learning model when you have very little labelled data?

Keep practising

Explore further