Lab 13-03 — Encoder Scheduling: Chunked Prefill Meets the Vision Tower [CPU-OK]
Phase 3's chunked prefill rests on a freedom you probably never noticed it claiming:
a prompt can be sliced anywhere. Multimodal revokes it. The positions inside a
placeholder range (lab-01) get their embeddings from the vision encoder, and you
cannot encode half a picture — the ViT runs on the whole image or not at all. So when
a prefill chunk first reaches into an image's range, the engine faces a real
scheduling decision: run the encoder this step (it costs real compute, governed by a
per-step encoder budget), or truncate the chunk at the image's doorstep and
try again next step. You'll implement that decision — vLLM V1's
_try_schedule_encoder_inputs, distilled — including the piece that restores chunked
prefill's freedom: the encoder cache, which lets later chunks continue
mid-placeholder for free.
Contents
- Why this lab exists
- Background: three resources now, not two
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This is the lab where two phases collide and you get to be the referee. Phase 3
taught that chunk boundaries are arbitrary (the clamp doesn't care what token it
stops at); lab-01 taught that some positions are image positions. The collision
produces every behavior in this lab's test suite, and each one is a production
symptom with a name: a VLM request whose prefill mysteriously stalls one token before
its image (encoder budget exhausted — test_unaffordable_image_truncates_the_chunk),
a step that schedules a 100-token encode for a chunk consuming only 40 image
positions (test_entering_an_image_schedules_its_encoder — the encoder is
all-or-nothing even when the decoder is incremental), a multi-image prompt that
prefills image A this step and stops dead before image B
(test_budget_splits_across_two_images).
The design lesson is the one the course keeps circling: vLLM did not forbid chunk boundaries inside images (which would couple the text scheduler to image geometry). It added a cache between the two engines — encode once, whole; consume incrementally, cached — so each side keeps its natural granularity. When two subsystems disagree about granularity, a cache at the boundary is usually the answer; this lab is the cleanest instance you'll ever implement.
Background: three resources now, not two
Phase 3's scheduler balanced the token budget and KV memory. Multimodal adds a third ledger, with its own units and its own cache:
- Encoder budget (per step, in encoder tokens): the vision tower is real compute outside the LM's token budget — a step that encodes a 576-token image while also prefilling text is doing two models' work. Capping encoder work per step protects ITL exactly the way the token budget does (Phase 3 lab-05's argument, new actor).
- Encoder cache (in encoder tokens of storage): outputs wait here between the encode and the chunks that consume them — and entries are freed once fully consumed. It's a third memory pool alongside KV blocks and LoRA slots (Phase 11 lab-04), with the same admission-pressure character.
The rule your plan_chunk implements, per placeholder the chunk would enter:
cached → free; affordable → schedule the whole encode now (even for partial
consumption); unaffordable → truncate the chunk to the placeholder's offset. And
the invariant the truncation preserves is Phase 3's invariant, extended: a position
is computed only when everything it needs exists — text positions need prior KV;
image positions need their encoding. Same race of counters, one more prerequisite.
Files
starter.py—plan_chunkwith the full rules in the docstring. Your work.solution.py— reference (~25 lines; the thinking is in the cases).test_lab.py— seven scenarios over a text/image/text/image/text prompt, from pure-text freedom to the zero-budget starvation edge.
Run
LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-03-encoder-scheduling -q
pytest phase-13-multimodal-models/labs/lab-03-encoder-scheduling -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_pure_text_chunk_is_unconstrained | Phase 3 behavior survives where no image is touched |
test_entering_an_image_schedules_its_encoder | All-or-nothing encoding: touching 40 of 100 image positions schedules the full 100-token encode — the granularity mismatch, faced |
test_unaffordable_image_truncates_the_chunk | The doorstep rule: chunk ends at offset, encoder runs empty. The mysterious stall, explained |
test_cached_image_costs_nothing / test_continuation_mid_placeholder_needs_no_new_encode | The cache restores chunk freedom: mid-placeholder continuation with zero encoder budget — the design's whole payoff |
test_budget_splits_across_two_images | Budget as a per-step ledger across images: A scheduled, B deferred, chunk truncated between them; enough budget → one-step prefill, two encodes |
test_progress_is_always_possible | The honest edge: image-at-position-0 with zero budget yields a 0-token chunk — progress waits for a step with budget. (Per-step budgets reset, so this is a delay, not a deadlock — but a scheduler that forgot to give encoder budget would starve VLM requests forever; the test documents the dependency) |
Hitchhiker's notes
- Map to upstream:
Scheduler._try_schedule_encoder_inputsinupstream/vllm/v1/core/sched/scheduler.py— your function with the encoder-cache space check added (the cache has finite storage; an encode can also be deferred because its output wouldn't fit), andencoder_budgetflowing fromscheduler_config. The encoder cache itself:vllm/v1/core/encoder_cache_manager.py— allocation, reference, and free-on-consumption; recognizably a tiny sibling of Phase 2's machinery. - Why encode-whole-but-consume-partial is safe: the encoder is not autoregressive — its output for an image is a pure function of pixels, independent of the text around it. That's what makes caching trivially correct (no chained hashes needed — contrast Phase 2 lab-05's ancestry chains) and what makes the all-or-nothing constraint tolerable: you never re-encode, ever, within a request.
- Where the embeddings actually flow: encoder output → encoder cache → the model
runner gathers the scheduled slice of cached embeddings each step and scatters
them over lab-01's placeholder positions (
get_input_embeddings). ThePlaceholderRangeis the shared coordinate system of all three labs — compile-time (lab-01), schedule-time (this lab), runtime (the scatter). - Capacity interaction worth knowing: encoder budget and token budget compete for
the same wall-clock step. A VLM fleet tuned with Phase 3 lab-05's threshold
analysis but ignoring encoder spikes still gets ITL spikes — from the vision tower.
vLLM's
disable_chunked_mm_inputand encoder-budget knobs exist for exactly this tuning; you now know what they gate.
Going further
- Add the encoder-cache space dimension:
plan_chunkalso receivescache_free_tokens, and an encode needs both budget and space; consumed entries free space for later steps. You've now matched upstream's full predicate — and created the three-pool admission dance (KV + encoder cache + budget) that real VLM scheduling is. - Simulate a step sequence: one request, the lab's two-image prompt, budget 150/step — emit the chunk plan per step until prefill completes. The trace (where chunks stall, when encodes fire, when the cache carries) is Phase 1 lab-04's probe, multimodal edition.
- Model the ITL spike from an encode (Phase 3 lab-05's method): give encoder tokens a cost weight and plot a decode stream's step costs when a VLM prefill with a 576-token image lands beside it, with and without an encoder budget. The conclusion writes the config recommendation.
References
upstream/vllm/v1/core/sched/scheduler.py—_try_schedule_encoder_inputs: this lab, in production (with the cache-space check).upstream/vllm/v1/core/encoder_cache_manager.py— the third memory pool.- vLLM blog, vLLM V1 — the encoder-cache design rationale: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
- Phase 3 labs 01/02/05 — the chunking machinery this lab constrains; lab-01 — the ranges it navigates by.