What is sequence padding and why is it necessary for batch training?
Padding adds dummy tokens to shorter sequences so all examples in a batch share the same length, which is required for tensor operations. Attention masks tell the model to ignore padded positions, preventing them from contributing to loss or attention scores.
How to think about it
Neural network frameworks process batches as rectangular tensors. If sentence A has 12 tokens and sentence B has 7, you cannot stack them into a 2x? matrix — the shape is undefined. Padding resolves this by filling shorter sequences to a fixed length using a special [PAD] token.
Two common padding strategies
- Post-padding: pad at the end.
[5, 22, 3, 0, 0]— favoured for RNNs because the meaningful tokens come first. - Pre-padding: pad at the start.
[0, 0, 5, 22, 3]— sometimes used when the model reads tokens in reverse.
Attention masks
Padding tokens must not influence model outputs. Transformers accept an attention mask — a binary tensor of the same shape — where 1 means “attend” and 0 means “ignore”.
from transformers import AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sentences = [
"Transformers use self-attention",
"Short sentence",
]
batch = tokenizer(
sentences,
padding=True, # pad to longest in batch
truncation=True,
max_length=16,
return_tensors="pt",
)
print(batch["input_ids"].shape) # (2, 16)
print(batch["attention_mask"])
# tensor([[1,1,1,1,1,1,1,0,...],
# [1,1,1,1,0,0,...]])
Dynamic vs. static padding
- Static padding: pad everything to a global
max_length(e.g. 512). Simple but wastes compute on short batches. - Dynamic padding (per-batch): pad only to the longest sequence in the current batch. Cuts GPU memory and speeds training by 20-40% on variable-length corpora.
Truncation: when a sequence exceeds max_length, tokens beyond the limit are dropped. Decide whether to truncate from the start or end depending on where the important content lives.