Phase 03 Labs — Continuous Batching & the Scheduler
Six labs around the engine's brain. The arc: build the scheduling loop (lab-01), prove chunked prefill safe (lab-02), measure why it exists (lab-05), survive memory pressure with preemption (lab-04), then account for prefix caching exactly (lab-06) and on real hardware (lab-03).
Recommended order: 01 → 02 → 05 → 04 → 06 → 03. (Directory numbers predate labs 05–06:
mechanism, then safety, then motive, then the emergency path, then the cache economics.)
CPU labs follow the standard contract — starter.py (your work), solution.py
(reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades
yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-03-continuous-batching-scheduler/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-01-scheduler-step -q
Contents
- lab-01-scheduler-step
[CPU-OK] - lab-02-chunked-prefill
[CPU-OK] - lab-03-prefix-cache-hitrate
[GPU-OPT] - lab-04-preemption
[CPU-OK] - lab-05-decode-latency-spikes
[CPU-OK] - lab-06-prefix-cache-savings
[CPU-OK] - What you can do after this phase
Labs
lab-01-scheduler-step [CPU-OK]
Implement the two-phase loop at the heart of continuous batching: serve RUNNING first
(decode-first is a policy, and iteration order is the policy), then admit WAITING — all
under three independent scarcities (token budget, sequence slots, KV memory) enforced at
three different points. Your 30 lines are, shape for shape, the core of
scheduler.py:329 upstream. Skills: budget/slot/memory enforcement; running-first;
head-of-line blocking as a fairness choice; one code path for prefill and decode.
lab-02-chunked-prefill [CPU-OK]
Prove the engine's most important safety property — chunking changes when tokens are
computed, never what tokens come out — by running the same deterministic workload under
both schedules and diffing token ids. Plus the timing side: predict prefill steps with
ceil(prompt/chunk) and know every boundary case. Skills: the causality + sampling-guard
argument; output-invariance as a CI-enforceable equality; the chunk-size trade-off.
lab-03-prefix-cache-hitrate [GPU-OPT]
Run the real engine on the canonical workload (long shared system prompt, unique tails) with prefix caching off and on, and read three independent meters that must agree: hit rate (0% → 93.7%), prompt throughput (4–5×), KV usage (~1× the prefix). Annotated capture included for the GPU-less; lab-06 is the exact-arithmetic CPU twin. Skills: constructing sharing-known workloads; reading hit-rate denominators; when caching buys nothing.
lab-04-preemption [CPU-OK]
Force the scheduler's emergency path in a pool where two requests cannot both fit: watch it evict the most-recent admission, let the survivor finish, then replay the victim — and prove the final outputs identical to a roomy pool's. Recovery is just prefill: the two-counters model makes eviction, chunking, and cache hits one code path. Skills: the allocate-or-preempt dance; victim policy as forward-progress argument; the deadlock invariant; pairing "survives Y" tests with "Y actually happened" probes.
lab-05-decode-latency-spikes [CPU-OK]
The motive for chunked prefill, measured: a decode stream's per-step cost profile when a
256-token prompt lands — [257, 2, 1, 1, ...] unchunked vs [33,×8, 2, 1, ...] at
threshold 32. Same total work, radically different tail latency; nothing free — the spike
spreads into the long prompt's TTFT. Skills: per-victim latency measurement; p99 vs mean;
the threshold/budget dial; why aggregate meters hide interference.
lab-06-prefix-cache-savings [CPU-OK]
Account for prefix caching to the exact token: 544 scheduled tokens uncached vs 96 cached, savings ≡ (N−1) × shared full blocks = 448, outputs bit-identical, and a share-nothing control arm that saves almost nothing. Includes the one-token prefill that immediately samples — three phases of rules colliding in a single scheduled token. Skills: the compute odometer; predicting cache value with integer arithmetic; eager caching at allocation time; validating noisy GPU meters against an exact model.
What you can do after this phase
Implement and modify vLLM's scheduling policy with the confidence of someone who has built
the loop, proven its invariants, and measured its trade-offs: explain why chunked prefill
is default-on (and what threshold to set, from data); predict prefix-cache savings for any
workload before enabling it; diagnose a preemption storm from the metrics and name the
right knob; and read vllm/v1/core/sched/scheduler.py end to end as a peer. Combined with
Phase 2, you now hold the complete control plane — Phase 4 descends into the kernels it
commands.