What are tokens in an LLM and why is API pricing per token rather than per word or character?
A token is the smallest unit a language model processes — typically a word, sub-word fragment, or punctuation mark produced by a byte-pair encoding (BPE) or similar algorithm. Pricing is per token because each token requires one forward-pass position in the attention matrix, directly driving compute and memory cost regardless of whether it maps to a full word or a single letter.
How to think about it
Text is continuous; neural networks need discrete, fixed-size inputs. Tokenization bridges the gap by splitting text into a vocabulary of sub-word units and mapping each to an integer index.
How tokenization works (BPE)
Byte-pair encoding (BPE) and its variants (WordPiece, SentencePiece, tiktoken) start from individual characters and iteratively merge the most frequent adjacent pair into a single token. The vocabulary is trained on a large corpus. Common English words like “the” or “cat” become single tokens; rarer or composite words are split — “unbelievable” might encode as [“un”, “believ”, “able”].
Rules of thumb
- ~1 token ≈ 4 characters of English prose.
- 1000 tokens ≈ 750 words.
- Code and non-English text are generally less efficient (more tokens per word) because they were underrepresented in vocabulary training.
- Numbers are tokenized digit-by-digit in many tokenizers, making arithmetic expensive.
Why pricing is per token, not per word
Every token occupies one position in the attention computation. A sequence of N tokens requires N² attention operations (standard attention) and N KV-cache entries. The model’s cost — measured in FLOPs and memory bandwidth — scales with token count, not word count. Charging per token aligns the price with the actual resource usage on the provider’s hardware.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Retrieval-augmented generation reduces hallucination.")
print(len(tokens)) # 8
print(tokens) # [3880, 30079, 12, 2, 51214, 9659, 17990, 13]
Practical implications
| Action | Token impact |
|---|---|
| Long system prompts | Charged on every API call, not once |
| Verbose few-shot examples | Multiply cost by number of examples |
| Whitespace padding | Every space may be its own token |
| JSON output | Brackets and quotes add tokens; compressed format saves cost |