Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 13-03 — Encoder Scheduling: Chunked Prefill Meets the Vision Tower [CPU-OK]

Phase 3's chunked prefill rests on a freedom you probably never noticed it claiming: a prompt can be sliced anywhere. Multimodal revokes it. The positions inside a placeholder range (lab-01) get their embeddings from the vision encoder, and you cannot encode half a picture — the ViT runs on the whole image or not at all. So when a prefill chunk first reaches into an image's range, the engine faces a real scheduling decision: run the encoder this step (it costs real compute, governed by a per-step encoder budget), or truncate the chunk at the image's doorstep and try again next step. You'll implement that decision — vLLM V1's _try_schedule_encoder_inputs, distilled — including the piece that restores chunked prefill's freedom: the encoder cache, which lets later chunks continue mid-placeholder for free.

Contents


Why this lab exists

This is the lab where two phases collide and you get to be the referee. Phase 3 taught that chunk boundaries are arbitrary (the clamp doesn't care what token it stops at); lab-01 taught that some positions are image positions. The collision produces every behavior in this lab's test suite, and each one is a production symptom with a name: a VLM request whose prefill mysteriously stalls one token before its image (encoder budget exhausted — test_unaffordable_image_truncates_the_chunk), a step that schedules a 100-token encode for a chunk consuming only 40 image positions (test_entering_an_image_schedules_its_encoder — the encoder is all-or-nothing even when the decoder is incremental), a multi-image prompt that prefills image A this step and stops dead before image B (test_budget_splits_across_two_images).

The design lesson is the one the course keeps circling: vLLM did not forbid chunk boundaries inside images (which would couple the text scheduler to image geometry). It added a cache between the two engines — encode once, whole; consume incrementally, cached — so each side keeps its natural granularity. When two subsystems disagree about granularity, a cache at the boundary is usually the answer; this lab is the cleanest instance you'll ever implement.

Background: three resources now, not two

Phase 3's scheduler balanced the token budget and KV memory. Multimodal adds a third ledger, with its own units and its own cache:

  • Encoder budget (per step, in encoder tokens): the vision tower is real compute outside the LM's token budget — a step that encodes a 576-token image while also prefilling text is doing two models' work. Capping encoder work per step protects ITL exactly the way the token budget does (Phase 3 lab-05's argument, new actor).
  • Encoder cache (in encoder tokens of storage): outputs wait here between the encode and the chunks that consume them — and entries are freed once fully consumed. It's a third memory pool alongside KV blocks and LoRA slots (Phase 11 lab-04), with the same admission-pressure character.

The rule your plan_chunk implements, per placeholder the chunk would enter: cached → free; affordable → schedule the whole encode now (even for partial consumption); unaffordable → truncate the chunk to the placeholder's offset. And the invariant the truncation preserves is Phase 3's invariant, extended: a position is computed only when everything it needs exists — text positions need prior KV; image positions need their encoding. Same race of counters, one more prerequisite.

Files

  • starter.pyplan_chunk with the full rules in the docstring. Your work.
  • solution.py — reference (~25 lines; the thinking is in the cases).
  • test_lab.py — seven scenarios over a text/image/text/image/text prompt, from pure-text freedom to the zero-budget starvation edge.

Run

LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-03-encoder-scheduling -q
pytest phase-13-multimodal-models/labs/lab-03-encoder-scheduling -q   # reference

What the tests prove

TestWhat it pins
test_pure_text_chunk_is_unconstrainedPhase 3 behavior survives where no image is touched
test_entering_an_image_schedules_its_encoderAll-or-nothing encoding: touching 40 of 100 image positions schedules the full 100-token encode — the granularity mismatch, faced
test_unaffordable_image_truncates_the_chunkThe doorstep rule: chunk ends at offset, encoder runs empty. The mysterious stall, explained
test_cached_image_costs_nothing / test_continuation_mid_placeholder_needs_no_new_encodeThe cache restores chunk freedom: mid-placeholder continuation with zero encoder budget — the design's whole payoff
test_budget_splits_across_two_imagesBudget as a per-step ledger across images: A scheduled, B deferred, chunk truncated between them; enough budget → one-step prefill, two encodes
test_progress_is_always_possibleThe honest edge: image-at-position-0 with zero budget yields a 0-token chunk — progress waits for a step with budget. (Per-step budgets reset, so this is a delay, not a deadlock — but a scheduler that forgot to give encoder budget would starve VLM requests forever; the test documents the dependency)

Hitchhiker's notes

  • Map to upstream: Scheduler._try_schedule_encoder_inputs in upstream/vllm/v1/core/sched/scheduler.py — your function with the encoder-cache space check added (the cache has finite storage; an encode can also be deferred because its output wouldn't fit), and encoder_budget flowing from scheduler_config. The encoder cache itself: vllm/v1/core/encoder_cache_manager.py — allocation, reference, and free-on-consumption; recognizably a tiny sibling of Phase 2's machinery.
  • Why encode-whole-but-consume-partial is safe: the encoder is not autoregressive — its output for an image is a pure function of pixels, independent of the text around it. That's what makes caching trivially correct (no chained hashes needed — contrast Phase 2 lab-05's ancestry chains) and what makes the all-or-nothing constraint tolerable: you never re-encode, ever, within a request.
  • Where the embeddings actually flow: encoder output → encoder cache → the model runner gathers the scheduled slice of cached embeddings each step and scatters them over lab-01's placeholder positions (get_input_embeddings). The PlaceholderRange is the shared coordinate system of all three labs — compile-time (lab-01), schedule-time (this lab), runtime (the scatter).
  • Capacity interaction worth knowing: encoder budget and token budget compete for the same wall-clock step. A VLM fleet tuned with Phase 3 lab-05's threshold analysis but ignoring encoder spikes still gets ITL spikes — from the vision tower. vLLM's disable_chunked_mm_input and encoder-budget knobs exist for exactly this tuning; you now know what they gate.

Going further

  • Add the encoder-cache space dimension: plan_chunk also receives cache_free_tokens, and an encode needs both budget and space; consumed entries free space for later steps. You've now matched upstream's full predicate — and created the three-pool admission dance (KV + encoder cache + budget) that real VLM scheduling is.
  • Simulate a step sequence: one request, the lab's two-image prompt, budget 150/step — emit the chunk plan per step until prefill completes. The trace (where chunks stall, when encodes fire, when the cache carries) is Phase 1 lab-04's probe, multimodal edition.
  • Model the ITL spike from an encode (Phase 3 lab-05's method): give encoder tokens a cost weight and plot a decode stream's step costs when a VLM prefill with a 576-token image lands beside it, with and without an encoder budget. The conclusion writes the config recommendation.

References

  • upstream/vllm/v1/core/sched/scheduler.py_try_schedule_encoder_inputs: this lab, in production (with the cache-space check).
  • upstream/vllm/v1/core/encoder_cache_manager.py — the third memory pool.
  • vLLM blog, vLLM V1 — the encoder-cache design rationale: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
  • Phase 3 labs 01/02/05 — the chunking machinery this lab constrains; lab-01 — the ranges it navigates by.