Phase 13 — The Hitchhiker's Guide to Multimodal Models
← Phase 12 · Course home · Phase 14 →
Contents
- Don't Panic
- Step 1: How a decoder-only LLM "sees" — the splice
- Step 2: Placeholders — the contract between processor and model
- Step 3: The cost — one image is a paragraph… or a chapter
- Step 4: The encoder cache — don't encode the same image twice
- Step 5: Encoder meets chunked prefill — the scheduling problem
- Step 6: Prefix caching with pixels — hashing the image itself
- The invariants to memorize
- What you'll do
Don't Panic
A vision-language model is not a new kind of engine. It's the same LLM you've been serving for twelve phases, plus a vision encoder bolted on the front. The image is encoded into a sequence of embedding vectors, and those vectors are spliced into the text embedding stream at reserved placeholder positions. From the first transformer layer onward, the model cannot tell which positions were words and which were pixels — KV cache, paged attention, continuous batching, all of it works unchanged. The new engineering is all around the splice: expanding placeholders, scheduling the encoder, and caching its output.
"What is in <image> ?" ┌──────────────┐
│ tokenize + expand pixels ──► vision encoder│──► [E1 E2 … E576]
▼ └──────────────┘ │ projector
[What][is][in][IMG][IMG]…[IMG][?] │ (to LLM dim)
│ embed text ▼
[w1][w2][w3][▢][▢]…[▢][w4] ──── overwrite ▢ positions ────► [w1 w2 w3 E1 … E576 w4]
│
▼
ordinary LLM forward (Phases 0–11)
Step 1: How a decoder-only LLM "sees" — the splice
Three parts (LLaVA is the canonical layout, llava.py):
- Vision encoder (a ViT): image → grid of patches (e.g. 24×24 = 576) → one embedding per patch.
- Projector (an MLP): maps encoder embeddings into the LLM's hidden dimension — two matrices is all it takes to make pixels speak the language model's language.
- The LLM: receives
inputs_embedswhere placeholder positions have been overwritten with projected image embeddings. In vLLM the overwrite is literally one indexed assignment:inputs_embeds[is_multimodal] = mm_embeds(models/utils.py:456,_merge_multimodal_embeddings).
That's the whole trick. Cross-attention encoder-decoder models (Whisper-style) are the exception, not the rule, in today's VLM zoo — the spliced decoder-only design won.
Step 2: Placeholders — the contract between processor and model
Before the model runs, the multimodal processor rewrites the prompt: the single
<image> marker becomes N repeated image tokens, and a PlaceholderRange(offset, length)
(multimodal/inputs.py:119) records exactly where. This bookkeeping is the contract:
- The tokenizer side promises: positions
[offset, offset+length)are dummies awaiting embeddings (some models interleave real structure — row separators — sois_embedcan mask which positions inside the range are actually image slots). - The model side promises: the encoder will produce exactly
length(oris_embed.sum()) embeddings. Get the count wrong and you get the classic VLM crash — upstream raises"Attempted to assign X multimodal tokens to Y placeholders"(utils.py:484). Your lab-01 makes you maintain this invariant by hand.
Step 3: The cost — one image is a paragraph… or a chapter
Image tokens are real tokens downstream: they occupy KV-cache blocks (Phase 2), consume scheduler token budget (Phase 3), and lengthen every later attention read. Typical scales:
| Model | One image becomes |
|---|---|
| LLaVA-1.5 (fixed 336²) | 576 tokens — always |
| Qwen2-VL (dynamic resolution) | ~4 → ~16k tokens, ∝ pixel count |
Dynamic resolution is the dangerous one: token count grows quadratically with image
side length (lab-02 measures the law on real Qwen2-VL). A 4-image request can dwarf its own
text. This is why MM models need their own memory profiling (compute_mm_encoder_budget,
encoder_cache_manager.py:269) — the worst-case image inflates both KV and the encoder
cache, and the engine must reserve for it at startup.
Step 4: The encoder cache — don't encode the same image twice
Encoder output is expensive (a full ViT forward) and reusable — the same image appears
across chunked-prefill steps of one request, across retries, across users pasting the same
screenshot. vLLM keeps finished encoder outputs in an EncoderCacheManager
(v1/core/encoder_cache_manager.py:17), a second cache next to the KV cache with its own
currency: it's measured in encoder embeddings, not blocks.
Design rhymes with Phase 2's block pool — learn the mapping:
| BlockPool (Phase 2) | EncoderCacheManager (here) |
|---|---|
| block hash | mm_hash (content hash of the image) |
ref_cnt | cached[mm_hash] = set of referencing request IDs |
| free queue (LRU eviction) | freeable OrderedDict (evict oldest unreferenced) |
| allocate / free | allocate / free_encoder_input, reclaim at allocation time |
Cross-request sharing falls out of content hashing: two requests with the same image hit
the same mm_hash (check_and_update_cache, :91).
Step 5: Encoder meets chunked prefill — the scheduling problem
Chunked prefill (Phase 3) slices a long prompt into budget-sized pieces. But an image
embedding is produced by one indivisible encoder forward — you can't compute the first
half of a ViT's patches this step and the rest next step. So the scheduler must reconcile
two granularities, and _try_schedule_encoder_inputs (scheduler.py:1096) is the
reconciliation. An encoder input is scheduled this step iff:
- its placeholder range overlaps the token window being computed,
[num_computed_tokens, num_computed_tokens + num_new_tokens); - it isn't already in the encoder cache;
- the per-step encoder compute budget has room (encoders are compute-heavy; unbounded encoder work would blow up step time exactly like unbounded prefill would);
- the encoder cache has space to hold the output.
If any check fails, the scheduler shrinks num_new_tokens to stop just before the
unschedulable image — decode the text up to the doorstep, wait for next step. And once
encoded-and-cached, a chunk boundary can land mid-placeholder freely: later chunks read
the cached embeddings. Lab-03 builds this exact logic, all-or-nothing encodes and all.
Step 6: Prefix caching with pixels — hashing the image itself
Phase 3's prefix cache keys blocks by token IDs — but two different images expand to the
same dummy token IDs! Sharing on token IDs alone would serve user B answers about user
A's photo. Fix: MultiModalHasher (multimodal/hasher.py:50) content-hashes the actual
image bytes, and that mm_hash is folded into the block hashes covering the placeholder
range. Same prompt + same pixels → full prefix-cache hit; same prompt + different pixels →
miss exactly at the image. (The same hash doubles as the encoder-cache key — one identity
for both caches.)
The invariants to memorize
- A VLM = encoder + projector + unchanged LLM; image embeddings overwrite placeholder
positions in
inputs_embeds. After the splice, the engine can't tell pixels from words. PlaceholderRangeis a contract: processor-side expansion count must equal encoder-side embedding count, exactly.- Image tokens are real tokens: they cost KV blocks, token budget, and attention time — dynamic-resolution models scale ∝ pixels (quadratic in side length).
- The encoder cache is a second cache with its own budget, keyed by content hash, ref-counted per request, LRU-evicted when unreferenced.
- Encoder runs are all-or-nothing; chunked prefill stops at the doorstep of an image it can't afford this step.
- Prefix caching must mix the image hash into block hashes — token IDs alone are ambiguous for placeholder spans.
What you'll do
- Read: 01-deep-dive.md — processor, placeholder machinery, encoder cache, scheduler hook, and LLaVA/Qwen2-VL as case studies, line-anchored.
- Build: 02-mini-build.md — a fake-image pipeline for
mini_vllm: placeholder expansion + toy encoder + content-hash cache. - Labs (see labs/README.md; recommended order 01 → 03 → 02):
lab-01-image-token-expansion[CPU-OK]— pixels → patches → tokens → blocks: placeholder expansion,PlaceholderRangebookkeeping, and the capacity punchline (one image = 38 KV blocks).lab-03-encoder-scheduling[CPU-OK]— chunked prefill meets the vision tower: per-step encoder budget, all-or-nothing encodes, truncate-at-the-doorstep, and the cache that restores mid-placeholder freedom (V1's_try_schedule_encoder_inputs, distilled).lab-02-run-a-vlm[GPU-OPT]— Qwen2-VL on a real photo: the 1,421-token "one-line" prompt, the quadratic resize law, the encoder's TTFT spike. Captured output included.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 12 · Course home · Phase 14 →