Lab 13-01 — Image-Token Expansion: Where Pictures Become Sequence Length [CPU-OK]
Here is the entire secret of multimodal serving, and it fits in one sentence: to the
engine, an image is just tokens. The user's prompt says <image>; the processor
replaces that single placeholder with N placeholder positions (144 for a 336×336
LLaVA-style image, 576+ for high-res); the vision encoder's embeddings will occupy
those positions; and from that moment every subsystem you've built in this course —
the scheduler's token budget (Phase 3), KV blocks (Phase 2), TTFT arithmetic
(Phase 1) — treats them like any other tokens. This lab implements the expansion: the
patch arithmetic that converts pixels to a token count, the splice that rewrites the
prompt, and the PlaceholderRange bookkeeping that remembers where each image lives —
the exact data structure upstream uses.
Contents
- Why this lab exists
- Background: pixels → patches → tokens → blocks
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Multimodal capacity surprises kill deployments. A chat service adds image support; the
prompt text barely grew, yet TTFT triples and concurrency halves — because every
image silently added hundreds of tokens that nobody counted. The arithmetic in this
lab is the inoculation: image_token_count tells you what a resolution costs,
test_resolution_is_quadratic_cost makes the scaling law visceral (double the sides,
4× the bill), and test_the_scheduler_sees_only_the_expanded_length does the
capacity-planning punchline — a "20-token prompt" with one image is a 595-token
request needing 38 KV blocks instead of 2. Run your traffic's image-size distribution
through these three functions before you ship a VLM, and Phase 0 lab-02's concurrency
math stays honest.
The deeper design point: expansion is how multimodality gets contained. The engine's core (scheduler, KV manager, attention) never learns what an image is — it sees a longer token sequence plus an opaque side-channel (the embeddings, delivered by lab-03's encoder scheduling). That containment is why vLLM could add vision, audio, and video without rewriting Phases 1–3, and it's the architectural pattern to copy: translate the exotic thing into the core's existing currency at the boundary.
Background: pixels → patches → tokens → blocks
The pipeline, stage by stage:
- Pixels → patches: ViT encoders slice the image into
patch × patchpixel tiles (14 px is the common size) —ceil(side / patch)per dimension. A 336×336 image: 24×24 = 576 patches. - Patches → tokens: many modern VLMs (Qwen-VL family and others) then merge
merge × mergeneighborhoods (pixel-unshuffle / spatial merge) to shrink the sequence: 24×24 → 12×12 = 144 tokens. Both divisions ceil — odd sizes round up at each stage, andtest_patch_arithmetic's 337-pixel case pins the double-ceiling (a classic off-by-one source when re-implementing processors). - Tokens → the sequence: each
<image>occurrence in the tokenized prompt is replaced by its image's count of sentinel ids, and aPlaceholderRange(offset, length)records the span — the coordinates lab-03's encoder scheduling and the model runner's embedding-scatter both navigate by. Multi-image prompts produce ordered, disjoint ranges whose offsets shift by earlier expansions (test_multi_image_ranges_are_ordered_and_disjointpins the shift). - Sequence → blocks: Phase 2's ceil-div, unchanged — image KV is KV.
Files
starter.py—image_token_count,expand_prompt,kv_blocks_needed. Your work.solution.py— reference.test_lab.py— the patch arithmetic (with the ceiling traps), quadratic scaling, the splice, multi-image ranges, the count-mismatch assert, and the capacity punchline.
Run
LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q
pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_patch_arithmetic | 336×336 → 144 (the LLaVA-ish number you'll see in lab-02's capture), the 1-token thumbnail floor, and ceiling-at-both-stages |
test_resolution_is_quadratic_cost | 2× resolution = 4× tokens — why high-res modes are a capacity feature, not just a quality feature, and why production VLM configs cap image size |
test_expansion_splices_in_place | The rewrite, exactly: text before, N sentinels, text after |
test_multi_image_ranges_are_ordered_and_disjoint | Range offsets account for earlier expansions; every placeholder position holds the sentinel. The bookkeeping the embedding scatter trusts |
test_mismatched_counts_assert | N placeholders demand N images — the validation that turns a garbled request into a clean 400 error instead of a runtime tensor-shape crash three layers deep |
test_the_scheduler_sees_only_the_expanded_length | 20 "text" tokens + 1 image = 595 tokens, 38 blocks. The line item your capacity model was missing |
Hitchhiker's notes
- Where this lives upstream: the per-model processor
(
upstream/vllm/model_executor/models/<model>.py+vllm/multimodal/processing.py) performs exactly this expansion at request-arrival time, emittingPlaceholderRanges (vllm/multimodal/inputs.py— same fields as yours). The count formula is model-specific (this is most of what differs between LLaVA, Qwen-VL, Pixtral processors); the splice machinery is shared. - The sentinel id never reaches the embedding table. At runtime the model runner
computes text embeddings for real ids and scatters the encoder's output over the
placeholder positions (
get_input_embeddingswith the ranges as the map). Your-100is upstream's reserved placeholder id — chosen, like all sentinels, to be un-confusable with a real token (Phase 2's null block, Phase 9's-1EOS: the course's sentinel family grows). - Prefix caching works for images — with one amendment you can now predict (Phase 2 lab-05's "anything that changes what KV means"): the block hash must include the image content hash, or two prompts with identical text but different pictures would share KV. vLLM hashes the multimodal items into the chain; same-image-same-text re-requests (retries, multi-turn over one photo) hit cache like any system prompt.
- Variable-resolution schemes (dynamic tiling à la InternVL/GPT-4V's "high-res crops") are this lab's formula applied per tile plus a global thumbnail — the token count becomes data-dependent, which is exactly why upstream processors compute counts from actual image dimensions instead of constants, and why your capacity model must use the traffic's real size distribution.
Going further
- Add
aspect_preserving_resize(w, h, max_side)→ new dims, then recompute the token bill — reproducing the resize-then-patch pipeline real processors run, and the knob (max_side) that trades quality for capacity. - Implement the embedding scatter: given
text_emb (seq, d),image_emb (n, d), and aPlaceholderRange, produce the merged input — ~3 lines with numpy slicing, and you've written the runtime half of this lab's compile-time work. - Compute the KV-bytes per image (144 tokens × Phase 0 lab-02's per-token bytes) for a 7B model, then for video at 1 fps × 60 s. The result explains why video models lean so hard on token merging and why "just feed it the video" is a memory proposal, not a feature request.
References
upstream/vllm/multimodal/inputs.py—PlaceholderRange, the real one.upstream/vllm/multimodal/processing.py— the expansion machinery (PromptReplacementand friends).- Liu et al., Visual Instruction Tuning (LLaVA, 2023) — the projector-into-the-token-stream design this lab models: https://arxiv.org/abs/2304.08485
- Qwen team, Qwen2-VL (2024) — the 2×2 spatial merge and dynamic resolution: https://arxiv.org/abs/2409.12191
- Lab-03 — who fills the placeholders, and what it costs the scheduler.