Phase 03 — Mini-Build: the continuous-batching scheduler
You'll build the scheduler that drives mini_vllm. The reference is already in the repo —
mini_vllm/scheduler.py — but write it yourself first against lab-01's stub + tests, then
diff.
This phase's mini-build depends on Phase 02's KV manager (mini_vllm/kv_cache.py), because the
scheduler's whole interaction with memory is allocate_slots / get_computed_blocks / free.
That dependency is the point: scheduling and paging are two halves of one machine.
Contents
The build, in order
1. SchedulerOutput
A small dataclass: num_scheduled_tokens: dict[str,int], scheduled_requests: list[Request],
preempted_req_ids: list[str], and a total_num_scheduled_tokens property. (Real:
output.py:181, much richer.)
2. Scheduler.__init__
Hold the KVCacheManager, max_num_seqs, max_num_batched_tokens,
long_prefill_token_threshold, and two queues: waiting: deque[Request], running: list[Request].
3. _clamp_new_tokens(num_new_tokens, token_budget)
Apply the long-prefill chunk cap (if 0 < threshold < num_new) then min(num_new, budget),
floored at 0. This single helper is chunked prefill. (Real: scheduler.py:390–392.)
4. schedule() — the two-phase loop
- Phase A (running): for each running request,
n = _clamp_new_tokens(num_tokens − num_computed, budget);allocate_slots; onNone, pop a victim fromrunning,_preemptit, record it, retry; commit andbudget -= n. - Phase B (waiting): while
waiting and budget>0 and len(running)<max_num_seqs and not preempted: peek the front request;get_computed_blocks→ setnum_computed_tokens;n = _clamp_new_tokens(...);allocate_slots(req, n, computed_blocks); onNonebreak; else move waiting→running, commit.
5. _preempt(request)
kv.free(request); num_computed_tokens = 0; num_preemptions += 1; status PREEMPTED;
waiting.appendleft(request) (re-admit ASAP).
6. update_from_output(output, sampled)
For each scheduled request: num_computed_tokens += n; if it sampled, append the token and
maybe_finish(). Reap finished requests: remove from running, kv.free.
7. needs_sample(request, num_scheduled) (static)
return request.num_computed_tokens + num_scheduled == request.num_tokens. Only fully-caught-up
requests emit a token (mid-prefill chunks don't).
Definition of done
pytest mini_vllm/test_scheduler.py -q # the reference suite (token budget, chunking,
# preemption, prefix head-start)
pytest phase-03-continuous-batching-scheduler/labs -q
Then run the full engine and confirm the scheduler's correctness invariants hold end to end:
pytest mini_vllm/test_engine.py -q
# test_chunked_prefill_matches_unchunked_output and
# test_prefix_caching_matches_no_caching_output PROVE that these optimizations change
# *timing/memory*, never *output*. That property is the whole game.
Stretch (sets up later phases)
- PRIORITY policy — add a
priorityfield toRequestand a PRIORITY branch to the preemption victim selection (maxby(priority, arrival_time)), mirroringscheduler.py:456. - Swapping vs recompute — instead of resetting
num_computed_tokens=0on preempt, "swap" the blocks to a CPU pool and restore them on re-admit. Compare the cost model (recompute is compute, swap is bandwidth) — a real vLLM design axis. - Stats — count preemptions, average batch size, and KV usage per step; you'll need these in Phase 18.