datarekha

Multimodal (vision & audio) LLMs

How a model 'sees': a vision encoder turns an image into patch tokens that flow into the LLM alongside text. Why images cost so many tokens, and the production patterns for documents, video, and audio.

8 min read Intermediate Generative AI Lesson 8 of 33

What you'll learn

  • How a vision-language model turns an image into tokens the LLM can read
  • Why image resolution drives token count, cost, and latency
  • Production patterns for documents, video, and audio inputs

Before you start

In 2026, “LLM” is a misnomer — the frontier models are multimodal: they take text, images, audio, and video, often in a single million-token context. If your mental model is text-only, you’re missing half of what these models do and most of where new products are being built. The good news: the mechanism is a small, elegant extension of everything you already know.

How a model “sees”

A vision-language model (VLM) has three parts bolted together:

  1. Vision encoder — usually a Vision Transformer. It chops the image into fixed patches (e.g. 14×14 pixels), and turns each patch into a vector — exactly like tokenization splits text into subwords, but for pixels.
  2. Projection layer — a small adapter that maps those patch vectors into the same embedding space the LLM uses for text tokens.
  3. The LLM — the usual transformer, now reading a stream of image tokens and text tokens side by side. To the LLM, an image is just “more tokens.”

That’s the whole trick: turn the image into tokens that live in the text token space, and the language model handles the rest. Audio works the same way (encode short audio frames into tokens); video is images-over-time.

Why images are expensive

Because an image becomes tokens, resolution drives cost. More pixels → more patches → more tokens → more money and latency. High-resolution inputs are often tiled into multiple crops, each adding its own block of tokens. Slide the resolution and watch the bill:

Production patterns

  • Documents — for PDFs and scans, VLMs now often beat traditional OCR because they read layout, tables, and handwriting in context. (The agentic-document parsing world, e.g. LlamaParse, is built on this.)
  • Video — sample frames (you rarely feed every frame) and feed them as a sequence of images; pair with the transcript for audio content.
  • Audio — either a speech-to-text front-end feeding a text LLM, or a natively multimodal model that ingests audio tokens directly for tone and non-speech cues.
  • Grounding outputs — many VLMs can return bounding boxes or point to regions, not just describe — useful for UI automation and document extraction.

Quick check

Quick check

0/3
Q1How does a vision-language model let an LLM 'read' an image?
Q2Why does sending a higher-resolution image cost more?
Q3What's a good default habit for multimodal cost control?

Next

Images are tokens, so they obey the same economics as everything else — cost & latency engineering and model routing apply directly. For reliable structured output from any model, see constrained decoding.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How do multimodal vision-language models combine images and text, and what role does CLIP play?

Vision-language models encode images with a vision encoder and project those features into the language model's token space so it can reason over images and text jointly, often via a connector or projection layer. CLIP is a contrastively trained image-text model that aligns image and text embeddings in a shared space, widely used as the vision backbone or for zero-shot retrieval and grounding.

What are tokens in an LLM and why is API pricing per token rather than per word or character?

A token is the smallest unit a language model processes — typically a word, sub-word fragment, or punctuation mark produced by a byte-pair encoding (BPE) or similar algorithm. Pricing is per token because each token requires one forward-pass position in the attention matrix, directly driving compute and memory cost regardless of whether it maps to a full word or a single letter.

How does an LLM generate text — what is next-token prediction and autoregression?

An LLM generates text one token at a time by computing a probability distribution over its entire vocabulary for the next token, sampling from that distribution, appending the result, and repeating — a process called autoregression. Each new token is conditioned on all previously generated tokens, so the output at step N is only as good as the choices made at steps 1 through N-1.

What causes LLM hallucinations and how can they be reduced?

Hallucinations occur because an LLM is trained to produce plausible next tokens, not verified facts — it has no internal truth-checking mechanism, only statistical patterns. Common causes include rare or conflicting training data, overconfident decoding, and prompts that lead the model to extrapolate beyond what it learned. Mitigation strategies include retrieval-augmented generation, grounding responses to retrieved sources, lowering temperature, and calibrated refusal training.

Related lessons

Explore further

Skip to content