Phase 13 — Exercises: Multimodal Models

Warm-up (explain)
Core (trace the code)
Build (your lab)
Design (staff-level)
Self-grading

Warm-up (explain)

In one breath: how does a decoder-only LLM "see" an image, and which engine components (Phases 0–11) need zero changes for it?
Why does vLLM keep an encoder cache separate from the KV cache? Name two ways their currencies and lifetimes differ.
Why can't the prefix cache key placeholder-covering blocks by token IDs alone?

Solution sketches

Vision encoder → projector → embeddings overwrite placeholder positions in inputs_embeds; from layer 1 on it's text-indistinguishable. Unchanged: paged KV, attention backends, sampler, batching — everything past the embedding layer.
KV cache: per-(request-prefix) layered K/V in fixed blocks, grows every step, freed at request end. Encoder cache: per-content (mm_hash) embeddings, measured in embedding slots not blocks, written once per image, shared across requests, LRU-evicted when unreferenced. Different key (position-prefix vs content), different unit, different lifecycle.
Every image expands to the same repeated dummy token ID — token-ID hashing would alias different pictures and serve one user's image context to another. The image's content hash (MultiModalHasher) must be folded into those block hashes.

Core (trace the code)

_get_prompt_updates (llava.py:264) — where does the expansion count come from, and why does Pixtral (:390) need PromptUpdateDetails.select_token_id /is_embed?
Walk EncoderCacheManager.check_and_update_cache (:91) for a request whose image is cached but currently unreferenced. Which structures change, and what is the Phase-2 analogue of this transition?
_try_schedule_encoder_inputs (scheduler.py:1096): an image's placeholder starts at token 5000, the request has computed 2000 tokens, and this step's chunk is 2048. What happens to the image, and to num_new_tokens?
The scheduler's manager tracks hashes but the runner holds tensors. Trace how an eviction decided by the scheduler reaches the worker (get_freed_mm_hashes → scheduler.py:901 → runner).

Solution sketches

From ProcessingInfo.get_num_image_tokens → the vision encoder's patch math (model config, image size). Pixtral interleaves [image_break_id] after each patch row, so not every position in the range receives an embedding — is_embed marks which do.
The hash is popped from freeable (it was an eviction candidate), its embed count is subtracted from num_freeable_slots, and the request ID joins cached[mm_hash]. Phase-2 analogue: BlockPool.touch — resurrecting a cached block from the free queue by bumping ref_cnt 0→1.
The chunk window [2000, 4048) doesn't reach offset 5000 → the image is not scheduled, and num_new_tokens is untouched (truncation only happens when the window overlaps an image that fails budget/cache checks). Next steps advance the window; the step whose window first overlaps 5000 must schedule (or truncate at) it.
Manager appends evicted hashes to freed; each step get_freed_mm_hashes() drains the list into SchedulerOutput.free_encoder_mm_hashes; workers delete those keys from their encoder_cache dict. Scheduler owns accounting, workers own memory — the same split as KV blocks.

Build (your lab)

In lab-01, compute: at block_size 16, how many KV blocks does one LLaVA image (576 tokens) cost, and what fraction of a 7B model's typical 8 GiB KV budget is 50 cached image-bearing prompts of 1000 tokens each?
Extend your mini-build's cache with a stats() method (hits, misses, evictions, occupancy) and write a test that drives hit-rate from 0% to >80% with a zipfian image distribution. Why is zipfian the realistic assumption?
In lab-03, construct a request where the encoder budget forces the image to wait one step but the cache-space check would have passed. Verify text progress continues. Then flip it: cache full, budget free. What's the user-visible difference?

Solution sketches

576/16 = 36 blocks for the image alone (38 with prompt rounding in the lab's setup). 50 × 1000 tokens ≈ 50 × 63 blocks ≈ 3150 blocks ≈ 25% of an 8 GiB budget at ~16 KiB/ block-token-layer scale — images eat KV budgets fast; exact numbers depend on the model, the point is the order of magnitude.
Real traffic repeats content (logos, screenshots, retried requests, multi-turn with the same image) with a long tail of singletons — zipf models that. Hits come from the head; the tail drives eviction churn.
Both delay the image, not the text (truncate-at-doorstep). Budget-limited: resolves next step deterministically. Cache-limited: resolves only when another request frees embeddings — potentially unbounded wait, which is why worst-case sizing at startup (compute_mm_encoder_budget) must guarantee a single max image always fits.

Design (staff-level)

Your fleet serves Qwen2-VL and users upload phone photos (12 MP). TTFT p99 is 4× worse than the text-only fleet. Walk the path pixels take and name the three biggest contributors + a mitigation for each.
Design multi-tenant fairness for the encoder cache: tenant A uploads thousands of unique images (0% reuse), tenant B reuses a product catalog (90% reuse). What goes wrong with global LRU and what do you change?
Should encoder outputs be prefix-cacheable across engine restarts (disk/remote)? Cost out the trade: embedding sizes vs re-encode time, and the consistency hazard the cache key must absorb.
Video: 1 fps × 60 s × ~hundreds of tokens/frame. Which Phase-13 mechanisms break first, and what does that tell you about why encode-disaggregation (Phase 15) exists?

Solution sketches

(a) Preprocessing/resize on CPU in the API process — move to async/parallel workers, downscale at the edge (Qwen2-VL token count ∝ pixels; cap max_pixels). (b) The ViT forward itself rides the first overlapping step — encoder budget tuning, or batch encoder work, or disaggregate encode (Phase 15). (c) Token inflation: 12 MP → tens of thousands of LLM tokens of prefill — enforce resolution limits server-side; chunked prefill spreads it but TTFT still pays.
Global LRU lets A's unique-image churn evict B's hot catalog (cache pollution by zero-reuse traffic). Fixes: per-tenant quotas/partitions, admission filter (only cache on second sight — a tiny bloom/ghost list), or weighted eviction favoring entries with reuse history.
An embedding tensor for a 576-token image at d=4096 fp16 ≈ 4.7 MB — often larger than the JPEG and comparable to re-encode time at high load; remote fetch can lose to recompute. Worth it only for very hot content. The key must absorb model identity + weights version + preprocessor config (resize policy!) — upstream's reset() on weight updates is the single-process version of that hazard.
Encoder cache capacity (a minute of video ≈ tens of thousands of embeddings) and the per-step encoder budget (one step can't afford a frame burst) break first; KV inflation follows. When encode work rivals decode work, sharing one GPU starves both — that's precisely the case for a separate encode fleet with its own scaling (Phase 15's encode disaggregation, EPD).

Self-grading

4–7 and 11–14 are interview-grade. Could you whiteboard the splice (processor → expand → encode → overwrite) and both caches' keys from memory? If not, re-read 01-deep-dive.md §3–§5.

vLLM Mastery — From Zero to Maintainer

Phase 13 — Exercises: Multimodal Models

Contents

Warm-up (explain)

Core (trace the code)

Build (your lab)

Design (staff-level)

Self-grading

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 13 — Exercises: Multimodal Models

Contents

Warm-up (explain)

Core (trace the code)

Build (your lab)

Design (staff-level)

Self-grading