Lab 13-03 — Encoder Scheduling: Chunked Prefill Meets the Vision Tower `[CPU-OK]`

Phase 3's chunked prefill rests on a freedom you probably never noticed it claiming: a prompt can be sliced anywhere. Multimodal revokes it. The positions inside a placeholder range (lab-01) get their embeddings from the vision encoder, and you cannot encode half a picture — the ViT runs on the whole image or not at all. So when a prefill chunk first reaches into an image's range, the engine faces a real scheduling decision: run the encoder this step (it costs real compute, governed by a per-step encoder budget), or truncate the chunk at the image's doorstep and try again next step. You'll implement that decision — vLLM V1's _try_schedule_encoder_inputs, distilled — including the piece that restores chunked prefill's freedom: the encoder cache, which lets later chunks continue mid-placeholder for free.

Why this lab exists
Background: three resources now, not two
Files
Run
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

This is the lab where two phases collide and you get to be the referee. Phase 3 taught that chunk boundaries are arbitrary (the clamp doesn't care what token it stops at); lab-01 taught that some positions are image positions. The collision produces every behavior in this lab's test suite, and each one is a production symptom with a name: a VLM request whose prefill mysteriously stalls one token before its image (encoder budget exhausted — test_unaffordable_image_truncates_the_chunk), a step that schedules a 100-token encode for a chunk consuming only 40 image positions (test_entering_an_image_schedules_its_encoder — the encoder is all-or-nothing even when the decoder is incremental), a multi-image prompt that prefills image A this step and stops dead before image B (test_budget_splits_across_two_images).

The design lesson is the one the course keeps circling: vLLM did not forbid chunk boundaries inside images (which would couple the text scheduler to image geometry). It added a cache between the two engines — encode once, whole; consume incrementally, cached — so each side keeps its natural granularity. When two subsystems disagree about granularity, a cache at the boundary is usually the answer; this lab is the cleanest instance you'll ever implement.

Background: three resources now, not two

Phase 3's scheduler balanced the token budget and KV memory. Multimodal adds a third ledger, with its own units and its own cache:

Encoder budget (per step, in encoder tokens): the vision tower is real compute outside the LM's token budget — a step that encodes a 576-token image while also prefilling text is doing two models' work. Capping encoder work per step protects ITL exactly the way the token budget does (Phase 3 lab-05's argument, new actor).
Encoder cache (in encoder tokens of storage): outputs wait here between the encode and the chunks that consume them — and entries are freed once fully consumed. It's a third memory pool alongside KV blocks and LoRA slots (Phase 11 lab-04), with the same admission-pressure character.

The rule your plan_chunk implements, per placeholder the chunk would enter: cached → free; affordable → schedule the whole encode now (even for partial consumption); unaffordable → truncate the chunk to the placeholder's offset. And the invariant the truncation preserves is Phase 3's invariant, extended: a position is computed only when everything it needs exists — text positions need prior KV; image positions need their encoding. Same race of counters, one more prerequisite.

Files

starter.py — plan_chunk with the full rules in the docstring. Your work.
solution.py — reference (~25 lines; the thinking is in the cases).
test_lab.py — seven scenarios over a text/image/text/image/text prompt, from pure-text freedom to the zero-budget starvation edge.

Run

LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-03-encoder-scheduling -q
pytest phase-13-multimodal-models/labs/lab-03-encoder-scheduling -q   # reference

What the tests prove

Test	What it pins
`test_pure_text_chunk_is_unconstrained`	Phase 3 behavior survives where no image is touched
`test_entering_an_image_schedules_its_encoder`	All-or-nothing encoding: touching 40 of 100 image positions schedules the full 100-token encode — the granularity mismatch, faced
`test_unaffordable_image_truncates_the_chunk`	The doorstep rule: chunk ends at `offset`, encoder runs empty. The mysterious stall, explained
`test_cached_image_costs_nothing` / `test_continuation_mid_placeholder_needs_no_new_encode`	The cache restores chunk freedom: mid-placeholder continuation with zero encoder budget — the design's whole payoff
`test_budget_splits_across_two_images`	Budget as a per-step ledger across images: A scheduled, B deferred, chunk truncated between them; enough budget → one-step prefill, two encodes
`test_progress_is_always_possible`	The honest edge: image-at-position-0 with zero budget yields a 0-token chunk — progress waits for a step with budget. (Per-step budgets reset, so this is a delay, not a deadlock — but a scheduler that forgot to give encoder budget would starve VLM requests forever; the test documents the dependency)

Hitchhiker's notes

Map to upstream: Scheduler._try_schedule_encoder_inputs in upstream/vllm/v1/core/sched/scheduler.py — your function with the encoder-cache space check added (the cache has finite storage; an encode can also be deferred because its output wouldn't fit), and encoder_budget flowing from scheduler_config. The encoder cache itself: vllm/v1/core/encoder_cache_manager.py — allocation, reference, and free-on-consumption; recognizably a tiny sibling of Phase 2's machinery.
Why encode-whole-but-consume-partial is safe: the encoder is not autoregressive — its output for an image is a pure function of pixels, independent of the text around it. That's what makes caching trivially correct (no chained hashes needed — contrast Phase 2 lab-05's ancestry chains) and what makes the all-or-nothing constraint tolerable: you never re-encode, ever, within a request.
Where the embeddings actually flow: encoder output → encoder cache → the model runner gathers the scheduled slice of cached embeddings each step and scatters them over lab-01's placeholder positions (get_input_embeddings). The PlaceholderRange is the shared coordinate system of all three labs — compile-time (lab-01), schedule-time (this lab), runtime (the scatter).
Capacity interaction worth knowing: encoder budget and token budget compete for the same wall-clock step. A VLM fleet tuned with Phase 3 lab-05's threshold analysis but ignoring encoder spikes still gets ITL spikes — from the vision tower. vLLM's disable_chunked_mm_input and encoder-budget knobs exist for exactly this tuning; you now know what they gate.

Going further

Add the encoder-cache space dimension: plan_chunk also receives cache_free_tokens, and an encode needs both budget and space; consumed entries free space for later steps. You've now matched upstream's full predicate — and created the three-pool admission dance (KV + encoder cache + budget) that real VLM scheduling is.
Simulate a step sequence: one request, the lab's two-image prompt, budget 150/step — emit the chunk plan per step until prefill completes. The trace (where chunks stall, when encodes fire, when the cache carries) is Phase 1 lab-04's probe, multimodal edition.
Model the ITL spike from an encode (Phase 3 lab-05's method): give encoder tokens a cost weight and plot a decode stream's step costs when a VLM prefill with a 576-token image lands beside it, with and without an encoder budget. The conclusion writes the config recommendation.

References

upstream/vllm/v1/core/sched/scheduler.py — _try_schedule_encoder_inputs: this lab, in production (with the cache-space check).
upstream/vllm/v1/core/encoder_cache_manager.py — the third memory pool.
vLLM blog, vLLM V1 — the encoder-cache design rationale: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
Phase 3 labs 01/02/05 — the chunking machinery this lab constrains; lab-01 — the ranges it navigates by.

vLLM Mastery — From Zero to Maintainer