How do multimodal vision-language models combine images and text, and what role does CLIP play?

Vision-language models encode images with a vision encoder and project those features into the language model's token space so it can reason over images and text jointly, often via a connector or projection layer. CLIP is a contrastively trained image-text model that aligns image and text embeddings in a shared space, widely used as the vision backbone or for zero-shot retrieval and grounding.

What are tokens in an LLM and why is API pricing per token rather than per word or character?

A token is the smallest unit a language model processes — typically a word, sub-word fragment, or punctuation mark produced by a byte-pair encoding (BPE) or similar algorithm. Pricing is per token because each token requires one forward-pass position in the attention matrix, directly driving compute and memory cost regardless of whether it maps to a full word or a single letter.

How does an LLM generate text — what is next-token prediction and autoregression?

An LLM generates text one token at a time by computing a probability distribution over its entire vocabulary for the next token, sampling from that distribution, appending the result, and repeating — a process called autoregression. Each new token is conditioned on all previously generated tokens, so the output at step N is only as good as the choices made at steps 1 through N-1.

What causes LLM hallucinations and how can they be reduced?

Hallucinations occur because an LLM is trained to produce plausible next tokens, not verified facts — it has no internal truth-checking mechanism, only statistical patterns. Common causes include rare or conflicting training data, overconfident decoding, and prompts that lead the model to extrapolate beyond what it learned. Mitigation strategies include retrieval-augmented generation, grounding responses to retrieved sources, lowering temperature, and calibrated refusal training.

Multimodal (vision & audio) LLMs — Generative AI

In 2026, “LLM” is a misnomer — the frontier models are multimodal: they take text, images, audio, and video, often in a single million-token context. If your mental model is text-only, you’re missing half of what these models do and most of where new products are being built. The good news: the mechanism is a small, elegant extension of everything you already know.

How a model “sees”

A vision-language model (VLM) has three parts bolted together:

Vision encoder — usually a Vision Transformer. It chops the image into fixed patches (e.g. 14×14 pixels), and turns each patch into a vector — exactly like tokenization splits text into subwords, but for pixels.
Projection layer — a small adapter that maps those patch vectors into the same embedding space the LLM uses for text tokens.
The LLM — the usual transformer, now reading a stream of image tokens and text tokens side by side. To the LLM, an image is just “more tokens.”

That’s the whole trick: turn the image into tokens that live in the text token space, and the language model handles the rest. Audio works the same way (encode short audio frames into tokens); video is images-over-time.

Why images are expensive

Because an image becomes tokens, resolution drives cost. More pixels → more patches → more tokens → more money and latency. High-resolution inputs are often tiled into multiple crops, each adding its own block of tokens. Watch the token bill climb as resolution rises:

# How image token cost scales with resolution (a representative tiled encoder).
BASE, TILE, TOK_PER_TILE = 85, 512, 256

def image_tokens(res):
    tiles = (-(-res // TILE)) ** 2          # ceil(res/512) ** 2 tiles
    return BASE + tiles * TOK_PER_TILE

print(f"{'resolution':>11} {'tiles':>6} {'tokens':>8}")
for res in [512, 1024, 1536, 2048]:
    tiles = (-(-res // TILE)) ** 2
    print(f"{res:10d}px {tiles:6d} {image_tokens(res):8d}")

 resolution  tiles   tokens
       512px      1      341
      1024px      4     1109
      1536px      9     2389
      2048px     16     4181

Doubling the resolution roughly quadruples the tiles — a 2048px image costs about 4181 tokens, more than most pages of text. Several such images can fill a context window on their own.

Production patterns

Documents — for PDFs and scans, VLMs now often beat traditional OCR because they read layout, tables, and handwriting in context. (The agentic-document parsing world, e.g. LlamaParse, is built on this.)
Video — sample frames (you rarely feed every frame) and feed them as a sequence of images; pair with the transcript for audio content.
Audio — either a speech-to-text front-end feeding a text LLM, or a natively multimodal model that ingests audio tokens directly for tone and non-speech cues.
Grounding outputs — many VLMs can return bounding boxes or point to regions, not just describe — useful for UI automation and document extraction.

In one breath

A vision-language model bolts a vision encoder + a projection layer onto an ordinary LLM.
The encoder splits an image into patches and turns each into a vector; the projection maps them into the text token space.
To the LLM, an image is “just more tokens” — read side by side with words; audio and video work the same way.
Resolution drives cost — more pixels → more patches/tiles → more tokens → more money and latency.
The biggest lever: downscale before you send, reserving high-detail mode for genuine fine print.

Quick check

0/3

Q1How does a vision-language model let an LLM 'read' an image?

Q2Why does sending a higher-resolution image cost more?

Q3What's a good default habit for multimodal cost control?

Images are tokens, so they obey the same economics as everything else — cost & latency engineering and model routing apply directly. For reliable structured output from any model, see constrained decoding.

Multimodal (vision & audio) LLMs

What you'll learn

Before you start

How a model “sees”

Why images are expensive

Production patterns

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further