Lab 13-01 — Image-Token Expansion: Where Pictures Become Sequence Length `[CPU-OK]`

Here is the entire secret of multimodal serving, and it fits in one sentence: to the engine, an image is just tokens. The user's prompt says <image>; the processor replaces that single placeholder with N placeholder positions (144 for a 336×336 LLaVA-style image, 576+ for high-res); the vision encoder's embeddings will occupy those positions; and from that moment every subsystem you've built in this course — the scheduler's token budget (Phase 3), KV blocks (Phase 2), TTFT arithmetic (Phase 1) — treats them like any other tokens. This lab implements the expansion: the patch arithmetic that converts pixels to a token count, the splice that rewrites the prompt, and the PlaceholderRange bookkeeping that remembers where each image lives — the exact data structure upstream uses.

Why this lab exists
Background: pixels → patches → tokens → blocks
Files
Run
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

Multimodal capacity surprises kill deployments. A chat service adds image support; the prompt text barely grew, yet TTFT triples and concurrency halves — because every image silently added hundreds of tokens that nobody counted. The arithmetic in this lab is the inoculation: image_token_count tells you what a resolution costs, test_resolution_is_quadratic_cost makes the scaling law visceral (double the sides, 4× the bill), and test_the_scheduler_sees_only_the_expanded_length does the capacity-planning punchline — a "20-token prompt" with one image is a 595-token request needing 38 KV blocks instead of 2. Run your traffic's image-size distribution through these three functions before you ship a VLM, and Phase 0 lab-02's concurrency math stays honest.

The deeper design point: expansion is how multimodality gets contained. The engine's core (scheduler, KV manager, attention) never learns what an image is — it sees a longer token sequence plus an opaque side-channel (the embeddings, delivered by lab-03's encoder scheduling). That containment is why vLLM could add vision, audio, and video without rewriting Phases 1–3, and it's the architectural pattern to copy: translate the exotic thing into the core's existing currency at the boundary.

Background: pixels → patches → tokens → blocks

The pipeline, stage by stage:

Pixels → patches: ViT encoders slice the image into patch × patch pixel tiles (14 px is the common size) — ceil(side / patch) per dimension. A 336×336 image: 24×24 = 576 patches.
Patches → tokens: many modern VLMs (Qwen-VL family and others) then merge merge × merge neighborhoods (pixel-unshuffle / spatial merge) to shrink the sequence: 24×24 → 12×12 = 144 tokens. Both divisions ceil — odd sizes round up at each stage, and test_patch_arithmetic's 337-pixel case pins the double-ceiling (a classic off-by-one source when re-implementing processors).
Tokens → the sequence: each <image> occurrence in the tokenized prompt is replaced by its image's count of sentinel ids, and a PlaceholderRange(offset, length) records the span — the coordinates lab-03's encoder scheduling and the model runner's embedding-scatter both navigate by. Multi-image prompts produce ordered, disjoint ranges whose offsets shift by earlier expansions (test_multi_image_ranges_are_ordered_and_disjoint pins the shift).
Sequence → blocks: Phase 2's ceil-div, unchanged — image KV is KV.

Files

starter.py — image_token_count, expand_prompt, kv_blocks_needed. Your work.
solution.py — reference.
test_lab.py — the patch arithmetic (with the ceiling traps), quadratic scaling, the splice, multi-image ranges, the count-mismatch assert, and the capacity punchline.

Run

LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q
pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q   # reference

What the tests prove

Test	What it pins
`test_patch_arithmetic`	336×336 → 144 (the LLaVA-ish number you'll see in lab-02's capture), the 1-token thumbnail floor, and ceiling-at-both-stages
`test_resolution_is_quadratic_cost`	2× resolution = 4× tokens — why high-res modes are a capacity feature, not just a quality feature, and why production VLM configs cap image size
`test_expansion_splices_in_place`	The rewrite, exactly: text before, N sentinels, text after
`test_multi_image_ranges_are_ordered_and_disjoint`	Range offsets account for earlier expansions; every placeholder position holds the sentinel. The bookkeeping the embedding scatter trusts
`test_mismatched_counts_assert`	N placeholders demand N images — the validation that turns a garbled request into a clean 400 error instead of a runtime tensor-shape crash three layers deep
`test_the_scheduler_sees_only_the_expanded_length`	20 "text" tokens + 1 image = 595 tokens, 38 blocks. The line item your capacity model was missing

Hitchhiker's notes

Where this lives upstream: the per-model processor (upstream/vllm/model_executor/models/<model>.py + vllm/multimodal/processing.py) performs exactly this expansion at request-arrival time, emitting PlaceholderRanges (vllm/multimodal/inputs.py — same fields as yours). The count formula is model-specific (this is most of what differs between LLaVA, Qwen-VL, Pixtral processors); the splice machinery is shared.
The sentinel id never reaches the embedding table. At runtime the model runner computes text embeddings for real ids and scatters the encoder's output over the placeholder positions (get_input_embeddings with the ranges as the map). Your -100 is upstream's reserved placeholder id — chosen, like all sentinels, to be un-confusable with a real token (Phase 2's null block, Phase 9's -1 EOS: the course's sentinel family grows).
Prefix caching works for images — with one amendment you can now predict (Phase 2 lab-05's "anything that changes what KV means"): the block hash must include the image content hash, or two prompts with identical text but different pictures would share KV. vLLM hashes the multimodal items into the chain; same-image-same-text re-requests (retries, multi-turn over one photo) hit cache like any system prompt.
Variable-resolution schemes (dynamic tiling à la InternVL/GPT-4V's "high-res crops") are this lab's formula applied per tile plus a global thumbnail — the token count becomes data-dependent, which is exactly why upstream processors compute counts from actual image dimensions instead of constants, and why your capacity model must use the traffic's real size distribution.

Going further

Add aspect_preserving_resize(w, h, max_side) → new dims, then recompute the token bill — reproducing the resize-then-patch pipeline real processors run, and the knob (max_side) that trades quality for capacity.
Implement the embedding scatter: given text_emb (seq, d), image_emb (n, d), and a PlaceholderRange, produce the merged input — ~3 lines with numpy slicing, and you've written the runtime half of this lab's compile-time work.
Compute the KV-bytes per image (144 tokens × Phase 0 lab-02's per-token bytes) for a 7B model, then for video at 1 fps × 60 s. The result explains why video models lean so hard on token merging and why "just feed it the video" is a memory proposal, not a feature request.

References

upstream/vllm/multimodal/inputs.py — PlaceholderRange, the real one.
upstream/vllm/multimodal/processing.py — the expansion machinery (PromptReplacement and friends).
Liu et al., Visual Instruction Tuning (LLaVA, 2023) — the projector-into-the-token-stream design this lab models: https://arxiv.org/abs/2304.08485
Qwen team, Qwen2-VL (2024) — the 2×2 spatial merge and dynamic resolution: https://arxiv.org/abs/2409.12191
Lab-03 — who fills the placeholders, and what it costs the scheduler.

vLLM Mastery — From Zero to Maintainer