Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 13 Labs — Multimodal Models

Three labs on the trick that lets a text engine see: translate pictures into the core's one currency — tokens — at the boundary, and keep Phases 1–3 untouched. The arc: build the expansion that turns pixels into sequence length (lab-01), referee the collision between chunked prefill and the can't-encode-half-a-picture vision tower (lab-03), then run a real VLM and reconcile every number — the 1,421-token "one-line" prompt, the quadratic resize law, the encoder's TTFT spike (lab-02).

Recommended order: 01 → 03 → 02. CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-13-multimodal-models/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q

Contents


Labs

lab-01-image-token-expansion [CPU-OK]

Pixels → patches → tokens → blocks: the ViT patch arithmetic (with its double-ceiling traps), the placeholder splice that rewrites the prompt, and the PlaceholderRange bookkeeping everything downstream navigates by. The punchline test: a "20-token prompt" with one image is a 595-token request needing 38 KV blocks. Skills: the quadratic resolution law; containment-by-translation as architecture; multi-image offset shifting; validating counts at the boundary.

lab-02-run-a-vlm [GPU-OPT]

Qwen2-VL-2B on a real photo: the ~30-token prompt arriving as 1,421 tokens, the ~4× drop on halving resolution (predicted first, measured second), and the 41 → 118 ms TTFT gap that is the vision encoder on a clock. Plus the operational surfaces: resize policy as the cheapest capacity lever, limit_mm_per_prompt, and the three-cache stack (processor / encoder / prefix). Annotated capture included. Skills: auditing a processor's decisions; segmenting TTFT by has-image; the quality cliff in resize tuning.

lab-03-encoder-scheduling [CPU-OK]

The collision: chunked prefill slices anywhere, but you can't encode half a picture. Implement V1's answer — per-step encoder budget, all-or-nothing encodes scheduled when a chunk enters a placeholder, truncate-at-the-doorstep when unaffordable, and the encoder cache that restores mid-placeholder freedom. Seven scenarios from pure-text to the zero-budget starvation edge. Skills: a third resource ledger; the cache-at-the-granularity-boundary pattern; why VLM prefills stall one token before their image.

What you can do after this phase

Price an image (or a video) in tokens, blocks, and TTFT before deploying it; predict and explain VLM capacity from the traffic's image-size distribution; tune the resize policy, encoder budget, and per-prompt limits with eyes open; and read vllm/multimodal/ plus the V1 encoder-scheduling path as machinery you've already built small. Phase 14 goes inside the models themselves — including how a vision tower bolts onto a language model in the first place.