Multimodal (vision & audio) LLMs
How a model 'sees': a vision encoder turns an image into patch tokens that flow into the LLM alongside text. Why images cost so many tokens, and the production patterns for documents, video, and audio.
What you'll learn
- How a vision-language model turns an image into tokens the LLM can read
- Why image resolution drives token count, cost, and latency
- Production patterns for documents, video, and audio inputs
Before you start
In 2026, “LLM” is a misnomer — the frontier models are multimodal: they take text, images, audio, and video, often in a single million-token context. If your mental model is text-only, you’re missing half of what these models do and most of where new products are being built. The good news: the mechanism is a small, elegant extension of everything you already know.
How a model “sees”
A vision-language model (VLM) has three parts bolted together:
- Vision encoder — usually a Vision Transformer. It chops the image into fixed patches (e.g. 14×14 pixels), and turns each patch into a vector — exactly like tokenization splits text into subwords, but for pixels.
- Projection layer — a small adapter that maps those patch vectors into the same embedding space the LLM uses for text tokens.
- The LLM — the usual transformer, now reading a stream of image tokens and text tokens side by side. To the LLM, an image is just “more tokens.”
That’s the whole trick: turn the image into tokens that live in the text token space, and the language model handles the rest. Audio works the same way (encode short audio frames into tokens); video is images-over-time.
Why images are expensive
Because an image becomes tokens, resolution drives cost. More pixels → more patches → more tokens → more money and latency. High-resolution inputs are often tiled into multiple crops, each adding its own block of tokens. Slide the resolution and watch the bill:
Production patterns
- Documents — for PDFs and scans, VLMs now often beat traditional OCR because they read layout, tables, and handwriting in context. (The agentic-document parsing world, e.g. LlamaParse, is built on this.)
- Video — sample frames (you rarely feed every frame) and feed them as a sequence of images; pair with the transcript for audio content.
- Audio — either a speech-to-text front-end feeding a text LLM, or a natively multimodal model that ingests audio tokens directly for tone and non-speech cues.
- Grounding outputs — many VLMs can return bounding boxes or point to regions, not just describe — useful for UI automation and document extraction.
Quick check
Quick check
Next
Images are tokens, so they obey the same economics as everything else — cost & latency engineering and model routing apply directly. For reliable structured output from any model, see constrained decoding.
Practice this in an interview
All questionsVision-language models encode images with a vision encoder and project those features into the language model's token space so it can reason over images and text jointly, often via a connector or projection layer. CLIP is a contrastively trained image-text model that aligns image and text embeddings in a shared space, widely used as the vision backbone or for zero-shot retrieval and grounding.
A token is the smallest unit a language model processes — typically a word, sub-word fragment, or punctuation mark produced by a byte-pair encoding (BPE) or similar algorithm. Pricing is per token because each token requires one forward-pass position in the attention matrix, directly driving compute and memory cost regardless of whether it maps to a full word or a single letter.
An LLM generates text one token at a time by computing a probability distribution over its entire vocabulary for the next token, sampling from that distribution, appending the result, and repeating — a process called autoregression. Each new token is conditioned on all previously generated tokens, so the output at step N is only as good as the choices made at steps 1 through N-1.
Hallucinations occur because an LLM is trained to produce plausible next tokens, not verified facts — it has no internal truth-checking mechanism, only statistical patterns. Common causes include rare or conflicting training data, overconfident decoding, and prompts that lead the model to extrapolate beyond what it learned. Mitigation strategies include retrieval-augmented generation, grounding responses to retrieved sources, lowering temperature, and calibrated refusal training.