Phase 13 — The Hitchhiker's Guide to Multimodal Models

← Phase 12 · Course home · Phase 14 →

Don't Panic
Step 1: How a decoder-only LLM "sees" — the splice
Step 2: Placeholders — the contract between processor and model
Step 3: The cost — one image is a paragraph… or a chapter
Step 4: The encoder cache — don't encode the same image twice
Step 5: Encoder meets chunked prefill — the scheduling problem
Step 6: Prefix caching with pixels — hashing the image itself
The invariants to memorize
What you'll do

Don't Panic

A vision-language model is not a new kind of engine. It's the same LLM you've been serving for twelve phases, plus a vision encoder bolted on the front. The image is encoded into a sequence of embedding vectors, and those vectors are spliced into the text embedding stream at reserved placeholder positions. From the first transformer layer onward, the model cannot tell which positions were words and which were pixels — KV cache, paged attention, continuous batching, all of it works unchanged. The new engineering is all around the splice: expanding placeholders, scheduling the encoder, and caching its output.

 "What is in <image> ?"                       ┌──────────────┐
        │ tokenize + expand          pixels ──► vision encoder│──► [E1 E2 … E576]
        ▼                                     └──────────────┘        │ projector
 [What][is][in][IMG][IMG]…[IMG][?]                                    │ (to LLM dim)
        │ embed text                                                  ▼
 [w1][w2][w3][▢][▢]…[▢][w4]  ──── overwrite ▢ positions ────► [w1 w2 w3 E1 … E576 w4]
                                                                      │
                                                                      ▼
                                                          ordinary LLM forward (Phases 0–11)

Step 1: How a decoder-only LLM "sees" — the splice

Three parts (LLaVA is the canonical layout, llava.py):

Vision encoder (a ViT): image → grid of patches (e.g. 24×24 = 576) → one embedding per patch.
Projector (an MLP): maps encoder embeddings into the LLM's hidden dimension — two matrices is all it takes to make pixels speak the language model's language.
The LLM: receives inputs_embeds where placeholder positions have been overwritten with projected image embeddings. In vLLM the overwrite is literally one indexed assignment: inputs_embeds[is_multimodal] = mm_embeds (models/utils.py:456, _merge_multimodal_embeddings).

That's the whole trick. Cross-attention encoder-decoder models (Whisper-style) are the exception, not the rule, in today's VLM zoo — the spliced decoder-only design won.

Step 2: Placeholders — the contract between processor and model

Before the model runs, the multimodal processor rewrites the prompt: the single <image> marker becomes N repeated image tokens, and a PlaceholderRange(offset, length) (multimodal/inputs.py:119) records exactly where. This bookkeeping is the contract:

The tokenizer side promises: positions [offset, offset+length) are dummies awaiting embeddings (some models interleave real structure — row separators — so is_embed can mask which positions inside the range are actually image slots).
The model side promises: the encoder will produce exactly length (or is_embed.sum()) embeddings. Get the count wrong and you get the classic VLM crash — upstream raises "Attempted to assign X multimodal tokens to Y placeholders" (utils.py:484). Your lab-01 makes you maintain this invariant by hand.

Step 3: The cost — one image is a paragraph… or a chapter

Image tokens are real tokens downstream: they occupy KV-cache blocks (Phase 2), consume scheduler token budget (Phase 3), and lengthen every later attention read. Typical scales:

Model	One image becomes
LLaVA-1.5 (fixed 336²)	576 tokens — always
Qwen2-VL (dynamic resolution)	~4 → ~16k tokens, ∝ pixel count

Dynamic resolution is the dangerous one: token count grows quadratically with image side length (lab-02 measures the law on real Qwen2-VL). A 4-image request can dwarf its own text. This is why MM models need their own memory profiling (compute_mm_encoder_budget, encoder_cache_manager.py:269) — the worst-case image inflates both KV and the encoder cache, and the engine must reserve for it at startup.

Step 4: The encoder cache — don't encode the same image twice

Encoder output is expensive (a full ViT forward) and reusable — the same image appears across chunked-prefill steps of one request, across retries, across users pasting the same screenshot. vLLM keeps finished encoder outputs in an EncoderCacheManager (v1/core/encoder_cache_manager.py:17), a second cache next to the KV cache with its own currency: it's measured in encoder embeddings, not blocks.

Design rhymes with Phase 2's block pool — learn the mapping:

BlockPool (Phase 2)	EncoderCacheManager (here)
block hash	`mm_hash` (content hash of the image)
`ref_cnt`	`cached[mm_hash]` = set of referencing request IDs
free queue (LRU eviction)	`freeable` OrderedDict (evict oldest unreferenced)
allocate / free	`allocate` / `free_encoder_input`, reclaim at allocation time

Cross-request sharing falls out of content hashing: two requests with the same image hit the same mm_hash (check_and_update_cache, :91).

Step 5: Encoder meets chunked prefill — the scheduling problem

Chunked prefill (Phase 3) slices a long prompt into budget-sized pieces. But an image embedding is produced by one indivisible encoder forward — you can't compute the first half of a ViT's patches this step and the rest next step. So the scheduler must reconcile two granularities, and _try_schedule_encoder_inputs (scheduler.py:1096) is the reconciliation. An encoder input is scheduled this step iff:

its placeholder range overlaps the token window being computed, [num_computed_tokens, num_computed_tokens + num_new_tokens);
it isn't already in the encoder cache;
the per-step encoder compute budget has room (encoders are compute-heavy; unbounded encoder work would blow up step time exactly like unbounded prefill would);
the encoder cache has space to hold the output.

If any check fails, the scheduler shrinks num_new_tokens to stop just before the unschedulable image — decode the text up to the doorstep, wait for next step. And once encoded-and-cached, a chunk boundary can land mid-placeholder freely: later chunks read the cached embeddings. Lab-03 builds this exact logic, all-or-nothing encodes and all.

Step 6: Prefix caching with pixels — hashing the image itself

Phase 3's prefix cache keys blocks by token IDs — but two different images expand to the same dummy token IDs! Sharing on token IDs alone would serve user B answers about user A's photo. Fix: MultiModalHasher (multimodal/hasher.py:50) content-hashes the actual image bytes, and that mm_hash is folded into the block hashes covering the placeholder range. Same prompt + same pixels → full prefix-cache hit; same prompt + different pixels → miss exactly at the image. (The same hash doubles as the encoder-cache key — one identity for both caches.)

The invariants to memorize

A VLM = encoder + projector + unchanged LLM; image embeddings overwrite placeholder positions in inputs_embeds. After the splice, the engine can't tell pixels from words.
PlaceholderRange is a contract: processor-side expansion count must equal encoder-side embedding count, exactly.
Image tokens are real tokens: they cost KV blocks, token budget, and attention time — dynamic-resolution models scale ∝ pixels (quadratic in side length).
The encoder cache is a second cache with its own budget, keyed by content hash, ref-counted per request, LRU-evicted when unreferenced.
Encoder runs are all-or-nothing; chunked prefill stops at the doorstep of an image it can't afford this step.
Prefix caching must mix the image hash into block hashes — token IDs alone are ambiguous for placeholder spans.

What you'll do

Read: 01-deep-dive.md — processor, placeholder machinery, encoder cache, scheduler hook, and LLaVA/Qwen2-VL as case studies, line-anchored.
Build: 02-mini-build.md — a fake-image pipeline for mini_vllm: placeholder expansion + toy encoder + content-hash cache.
Labs (see labs/README.md; recommended order 01 → 03 → 02):
- lab-01-image-token-expansion [CPU-OK] — pixels → patches → tokens → blocks: placeholder expansion, PlaceholderRange bookkeeping, and the capacity punchline (one image = 38 KV blocks).
- lab-03-encoder-scheduling [CPU-OK] — chunked prefill meets the vision tower: per-step encoder budget, all-or-nothing encodes, truncate-at-the-doorstep, and the cache that restores mid-placeholder freedom (V1's _try_schedule_encoder_inputs, distilled).
- lab-02-run-a-vlm [GPU-OPT] — Qwen2-VL on a real photo: the 1,421-token "one-line" prompt, the quadratic resize law, the encoder's TTFT spike. Captured output included.
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.