Phase 13 — Exercises: Multimodal Models
Contents
Warm-up (explain)
- In one breath: how does a decoder-only LLM "see" an image, and which engine components (Phases 0–11) need zero changes for it?
- Why does vLLM keep an encoder cache separate from the KV cache? Name two ways their currencies and lifetimes differ.
- Why can't the prefix cache key placeholder-covering blocks by token IDs alone?
Solution sketches
- Vision encoder → projector → embeddings overwrite placeholder positions in
inputs_embeds; from layer 1 on it's text-indistinguishable. Unchanged: paged KV, attention backends, sampler, batching — everything past the embedding layer. - KV cache: per-(request-prefix) layered K/V in fixed blocks, grows every step, freed at request end. Encoder cache: per-content (mm_hash) embeddings, measured in embedding slots not blocks, written once per image, shared across requests, LRU-evicted when unreferenced. Different key (position-prefix vs content), different unit, different lifecycle.
- Every image expands to the same repeated dummy token ID — token-ID hashing would
alias different pictures and serve one user's image context to another. The image's
content hash (
MultiModalHasher) must be folded into those block hashes.
Core (trace the code)
_get_prompt_updates(llava.py:264) — where does the expansion count come from, and why does Pixtral (:390) needPromptUpdateDetails.select_token_id/is_embed?- Walk
EncoderCacheManager.check_and_update_cache(:91) for a request whose image is cached but currently unreferenced. Which structures change, and what is the Phase-2 analogue of this transition? _try_schedule_encoder_inputs(scheduler.py:1096): an image's placeholder starts at token 5000, the request has computed 2000 tokens, and this step's chunk is 2048. What happens to the image, and tonum_new_tokens?- The scheduler's manager tracks hashes but the runner holds tensors. Trace how an
eviction decided by the scheduler reaches the worker (
get_freed_mm_hashes→scheduler.py:901→ runner).
Solution sketches
- From
ProcessingInfo.get_num_image_tokens→ the vision encoder's patch math (model config, image size). Pixtral interleaves[image_break_id]after each patch row, so not every position in the range receives an embedding —is_embedmarks which do. - The hash is popped from
freeable(it was an eviction candidate), its embed count is subtracted fromnum_freeable_slots, and the request ID joinscached[mm_hash]. Phase-2 analogue:BlockPool.touch— resurrecting a cached block from the free queue by bumpingref_cnt0→1. - The chunk window [2000, 4048) doesn't reach offset 5000 → the image is not scheduled,
and
num_new_tokensis untouched (truncation only happens when the window overlaps an image that fails budget/cache checks). Next steps advance the window; the step whose window first overlaps 5000 must schedule (or truncate at) it. - Manager appends evicted hashes to
freed; each stepget_freed_mm_hashes()drains the list intoSchedulerOutput.free_encoder_mm_hashes; workers delete those keys from theirencoder_cachedict. Scheduler owns accounting, workers own memory — the same split as KV blocks.
Build (your lab)
- In lab-01, compute: at block_size 16, how many KV blocks does one LLaVA image (576 tokens) cost, and what fraction of a 7B model's typical 8 GiB KV budget is 50 cached image-bearing prompts of 1000 tokens each?
- Extend your mini-build's cache with a
stats()method (hits, misses, evictions, occupancy) and write a test that drives hit-rate from 0% to >80% with a zipfian image distribution. Why is zipfian the realistic assumption? - In lab-03, construct a request where the encoder budget forces the image to wait one step but the cache-space check would have passed. Verify text progress continues. Then flip it: cache full, budget free. What's the user-visible difference?
Solution sketches
- 576/16 = 36 blocks for the image alone (38 with prompt rounding in the lab's setup). 50 × 1000 tokens ≈ 50 × 63 blocks ≈ 3150 blocks ≈ 25% of an 8 GiB budget at ~16 KiB/ block-token-layer scale — images eat KV budgets fast; exact numbers depend on the model, the point is the order of magnitude.
- Real traffic repeats content (logos, screenshots, retried requests, multi-turn with the same image) with a long tail of singletons — zipf models that. Hits come from the head; the tail drives eviction churn.
- Both delay the image, not the text (truncate-at-doorstep). Budget-limited: resolves
next step deterministically. Cache-limited: resolves only when another request frees
embeddings — potentially unbounded wait, which is why worst-case sizing at startup
(
compute_mm_encoder_budget) must guarantee a single max image always fits.
Design (staff-level)
- Your fleet serves Qwen2-VL and users upload phone photos (12 MP). TTFT p99 is 4× worse than the text-only fleet. Walk the path pixels take and name the three biggest contributors + a mitigation for each.
- Design multi-tenant fairness for the encoder cache: tenant A uploads thousands of unique images (0% reuse), tenant B reuses a product catalog (90% reuse). What goes wrong with global LRU and what do you change?
- Should encoder outputs be prefix-cacheable across engine restarts (disk/remote)? Cost out the trade: embedding sizes vs re-encode time, and the consistency hazard the cache key must absorb.
- Video: 1 fps × 60 s × ~hundreds of tokens/frame. Which Phase-13 mechanisms break first, and what does that tell you about why encode-disaggregation (Phase 15) exists?
Solution sketches
- (a) Preprocessing/resize on CPU in the API process — move to async/parallel workers,
downscale at the edge (Qwen2-VL token count ∝ pixels; cap
max_pixels). (b) The ViT forward itself rides the first overlapping step — encoder budget tuning, or batch encoder work, or disaggregate encode (Phase 15). (c) Token inflation: 12 MP → tens of thousands of LLM tokens of prefill — enforce resolution limits server-side; chunked prefill spreads it but TTFT still pays. - Global LRU lets A's unique-image churn evict B's hot catalog (cache pollution by zero-reuse traffic). Fixes: per-tenant quotas/partitions, admission filter (only cache on second sight — a tiny bloom/ghost list), or weighted eviction favoring entries with reuse history.
- An embedding tensor for a 576-token image at d=4096 fp16 ≈ 4.7 MB — often larger
than the JPEG and comparable to re-encode time at high load; remote fetch can lose to
recompute. Worth it only for very hot content. The key must absorb model identity +
weights version + preprocessor config (resize policy!) — upstream's
reset()on weight updates is the single-process version of that hazard. - Encoder cache capacity (a minute of video ≈ tens of thousands of embeddings) and the per-step encoder budget (one step can't afford a frame burst) break first; KV inflation follows. When encode work rivals decode work, sharing one GPU starves both — that's precisely the case for a separate encode fleet with its own scaling (Phase 15's encode disaggregation, EPD).
Self-grading
4–7 and 11–14 are interview-grade. Could you whiteboard the splice (processor → expand → encode → overwrite) and both caches' keys from memory? If not, re-read 01-deep-dive.md §3–§5.