Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 13-01 — Image-Token Expansion: Where Pictures Become Sequence Length [CPU-OK]

Here is the entire secret of multimodal serving, and it fits in one sentence: to the engine, an image is just tokens. The user's prompt says <image>; the processor replaces that single placeholder with N placeholder positions (144 for a 336×336 LLaVA-style image, 576+ for high-res); the vision encoder's embeddings will occupy those positions; and from that moment every subsystem you've built in this course — the scheduler's token budget (Phase 3), KV blocks (Phase 2), TTFT arithmetic (Phase 1) — treats them like any other tokens. This lab implements the expansion: the patch arithmetic that converts pixels to a token count, the splice that rewrites the prompt, and the PlaceholderRange bookkeeping that remembers where each image lives — the exact data structure upstream uses.

Contents


Why this lab exists

Multimodal capacity surprises kill deployments. A chat service adds image support; the prompt text barely grew, yet TTFT triples and concurrency halves — because every image silently added hundreds of tokens that nobody counted. The arithmetic in this lab is the inoculation: image_token_count tells you what a resolution costs, test_resolution_is_quadratic_cost makes the scaling law visceral (double the sides, 4× the bill), and test_the_scheduler_sees_only_the_expanded_length does the capacity-planning punchline — a "20-token prompt" with one image is a 595-token request needing 38 KV blocks instead of 2. Run your traffic's image-size distribution through these three functions before you ship a VLM, and Phase 0 lab-02's concurrency math stays honest.

The deeper design point: expansion is how multimodality gets contained. The engine's core (scheduler, KV manager, attention) never learns what an image is — it sees a longer token sequence plus an opaque side-channel (the embeddings, delivered by lab-03's encoder scheduling). That containment is why vLLM could add vision, audio, and video without rewriting Phases 1–3, and it's the architectural pattern to copy: translate the exotic thing into the core's existing currency at the boundary.

Background: pixels → patches → tokens → blocks

The pipeline, stage by stage:

  1. Pixels → patches: ViT encoders slice the image into patch × patch pixel tiles (14 px is the common size) — ceil(side / patch) per dimension. A 336×336 image: 24×24 = 576 patches.
  2. Patches → tokens: many modern VLMs (Qwen-VL family and others) then merge merge × merge neighborhoods (pixel-unshuffle / spatial merge) to shrink the sequence: 24×24 → 12×12 = 144 tokens. Both divisions ceil — odd sizes round up at each stage, and test_patch_arithmetic's 337-pixel case pins the double-ceiling (a classic off-by-one source when re-implementing processors).
  3. Tokens → the sequence: each <image> occurrence in the tokenized prompt is replaced by its image's count of sentinel ids, and a PlaceholderRange(offset, length) records the span — the coordinates lab-03's encoder scheduling and the model runner's embedding-scatter both navigate by. Multi-image prompts produce ordered, disjoint ranges whose offsets shift by earlier expansions (test_multi_image_ranges_are_ordered_and_disjoint pins the shift).
  4. Sequence → blocks: Phase 2's ceil-div, unchanged — image KV is KV.

Files

  • starter.pyimage_token_count, expand_prompt, kv_blocks_needed. Your work.
  • solution.py — reference.
  • test_lab.py — the patch arithmetic (with the ceiling traps), quadratic scaling, the splice, multi-image ranges, the count-mismatch assert, and the capacity punchline.

Run

LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q
pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q   # reference

What the tests prove

TestWhat it pins
test_patch_arithmetic336×336 → 144 (the LLaVA-ish number you'll see in lab-02's capture), the 1-token thumbnail floor, and ceiling-at-both-stages
test_resolution_is_quadratic_cost2× resolution = 4× tokens — why high-res modes are a capacity feature, not just a quality feature, and why production VLM configs cap image size
test_expansion_splices_in_placeThe rewrite, exactly: text before, N sentinels, text after
test_multi_image_ranges_are_ordered_and_disjointRange offsets account for earlier expansions; every placeholder position holds the sentinel. The bookkeeping the embedding scatter trusts
test_mismatched_counts_assertN placeholders demand N images — the validation that turns a garbled request into a clean 400 error instead of a runtime tensor-shape crash three layers deep
test_the_scheduler_sees_only_the_expanded_length20 "text" tokens + 1 image = 595 tokens, 38 blocks. The line item your capacity model was missing

Hitchhiker's notes

  • Where this lives upstream: the per-model processor (upstream/vllm/model_executor/models/<model>.py + vllm/multimodal/processing.py) performs exactly this expansion at request-arrival time, emitting PlaceholderRanges (vllm/multimodal/inputs.py — same fields as yours). The count formula is model-specific (this is most of what differs between LLaVA, Qwen-VL, Pixtral processors); the splice machinery is shared.
  • The sentinel id never reaches the embedding table. At runtime the model runner computes text embeddings for real ids and scatters the encoder's output over the placeholder positions (get_input_embeddings with the ranges as the map). Your -100 is upstream's reserved placeholder id — chosen, like all sentinels, to be un-confusable with a real token (Phase 2's null block, Phase 9's -1 EOS: the course's sentinel family grows).
  • Prefix caching works for images — with one amendment you can now predict (Phase 2 lab-05's "anything that changes what KV means"): the block hash must include the image content hash, or two prompts with identical text but different pictures would share KV. vLLM hashes the multimodal items into the chain; same-image-same-text re-requests (retries, multi-turn over one photo) hit cache like any system prompt.
  • Variable-resolution schemes (dynamic tiling à la InternVL/GPT-4V's "high-res crops") are this lab's formula applied per tile plus a global thumbnail — the token count becomes data-dependent, which is exactly why upstream processors compute counts from actual image dimensions instead of constants, and why your capacity model must use the traffic's real size distribution.

Going further

  • Add aspect_preserving_resize(w, h, max_side) → new dims, then recompute the token bill — reproducing the resize-then-patch pipeline real processors run, and the knob (max_side) that trades quality for capacity.
  • Implement the embedding scatter: given text_emb (seq, d), image_emb (n, d), and a PlaceholderRange, produce the merged input — ~3 lines with numpy slicing, and you've written the runtime half of this lab's compile-time work.
  • Compute the KV-bytes per image (144 tokens × Phase 0 lab-02's per-token bytes) for a 7B model, then for video at 1 fps × 60 s. The result explains why video models lean so hard on token merging and why "just feed it the video" is a memory proposal, not a feature request.

References

  • upstream/vllm/multimodal/inputs.pyPlaceholderRange, the real one.
  • upstream/vllm/multimodal/processing.py — the expansion machinery (PromptReplacement and friends).
  • Liu et al., Visual Instruction Tuning (LLaVA, 2023) — the projector-into-the-token-stream design this lab models: https://arxiv.org/abs/2304.08485
  • Qwen team, Qwen2-VL (2024) — the 2×2 spatial merge and dynamic resolution: https://arxiv.org/abs/2409.12191
  • Lab-03 — who fills the placeholders, and what it costs the scheduler.