Phase 13 Labs — Multimodal Models
Three labs on the trick that lets a text engine see: translate pictures into the core's one currency — tokens — at the boundary, and keep Phases 1–3 untouched. The arc: build the expansion that turns pixels into sequence length (lab-01), referee the collision between chunked prefill and the can't-encode-half-a-picture vision tower (lab-03), then run a real VLM and reconcile every number — the 1,421-token "one-line" prompt, the quadratic resize law, the encoder's TTFT spike (lab-02).
Recommended order: 01 → 03 → 02. CPU labs follow the standard contract —
starter.py (your work), solution.py (reference), test_lab.py (the spec); default
runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-13-multimodal-models/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q
Contents
- lab-01-image-token-expansion
[CPU-OK] - lab-02-run-a-vlm
[GPU-OPT] - lab-03-encoder-scheduling
[CPU-OK] - What you can do after this phase
Labs
lab-01-image-token-expansion [CPU-OK]
Pixels → patches → tokens → blocks: the ViT patch arithmetic (with its double-ceiling
traps), the placeholder splice that rewrites the prompt, and the PlaceholderRange
bookkeeping everything downstream navigates by. The punchline test: a "20-token
prompt" with one image is a 595-token request needing 38 KV blocks. Skills: the
quadratic resolution law; containment-by-translation as architecture; multi-image
offset shifting; validating counts at the boundary.
lab-02-run-a-vlm [GPU-OPT]
Qwen2-VL-2B on a real photo: the ~30-token prompt arriving as 1,421 tokens, the ~4×
drop on halving resolution (predicted first, measured second), and the 41 → 118 ms
TTFT gap that is the vision encoder on a clock. Plus the operational surfaces:
resize policy as the cheapest capacity lever, limit_mm_per_prompt, and the
three-cache stack (processor / encoder / prefix). Annotated capture included.
Skills: auditing a processor's decisions; segmenting TTFT by has-image; the
quality cliff in resize tuning.
lab-03-encoder-scheduling [CPU-OK]
The collision: chunked prefill slices anywhere, but you can't encode half a picture. Implement V1's answer — per-step encoder budget, all-or-nothing encodes scheduled when a chunk enters a placeholder, truncate-at-the-doorstep when unaffordable, and the encoder cache that restores mid-placeholder freedom. Seven scenarios from pure-text to the zero-budget starvation edge. Skills: a third resource ledger; the cache-at-the-granularity-boundary pattern; why VLM prefills stall one token before their image.
What you can do after this phase
Price an image (or a video) in tokens, blocks, and TTFT before deploying it; predict
and explain VLM capacity from the traffic's image-size distribution; tune the resize
policy, encoder budget, and per-prompt limits with eyes open; and read
vllm/multimodal/ plus the V1 encoder-scheduling path as machinery you've already
built small. Phase 14 goes inside the models themselves — including how a vision
tower bolts onto a language model in the first place.