datarekha
NLP & LLMs Easy Asked at GoogleAsked at AmazonAsked at Meta

What is sequence padding and why is it necessary for batch training?

The short answer

Padding adds dummy tokens to shorter sequences so all examples in a batch share the same length, which is required for tensor operations. Attention masks tell the model to ignore padded positions, preventing them from contributing to loss or attention scores.

How to think about it

Neural network frameworks process batches as rectangular tensors. If sentence A has 12 tokens and sentence B has 7, you cannot stack them into a 2x? matrix — the shape is undefined. Padding resolves this by filling shorter sequences to a fixed length using a special [PAD] token.

Two common padding strategies

  • Post-padding: pad at the end. [5, 22, 3, 0, 0] — favoured for RNNs because the meaningful tokens come first.
  • Pre-padding: pad at the start. [0, 0, 5, 22, 3] — sometimes used when the model reads tokens in reverse.

Attention masks

Padding tokens must not influence model outputs. Transformers accept an attention mask — a binary tensor of the same shape — where 1 means “attend” and 0 means “ignore”.

from transformers import AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

sentences = [
    "Transformers use self-attention",
    "Short sentence",
]

batch = tokenizer(
    sentences,
    padding=True,       # pad to longest in batch
    truncation=True,
    max_length=16,
    return_tensors="pt",
)

print(batch["input_ids"].shape)      # (2, 16)
print(batch["attention_mask"])
# tensor([[1,1,1,1,1,1,1,0,...],
#         [1,1,1,1,0,0,...]])

Dynamic vs. static padding

  • Static padding: pad everything to a global max_length (e.g. 512). Simple but wastes compute on short batches.
  • Dynamic padding (per-batch): pad only to the longest sequence in the current batch. Cuts GPU memory and speeds training by 20-40% on variable-length corpora.

Truncation: when a sequence exceeds max_length, tokens beyond the limit are dropped. Decide whether to truncate from the start or end depending on where the important content lives.

Learn it properly Self-attention

Keep practising

All NLP & LLMs questions

Explore further

Skip to content