Phase 03 — Mini-Build: the continuous-batching scheduler

You'll build the scheduler that drives mini_vllm. The reference is already in the repo — mini_vllm/scheduler.py — but write it yourself first against lab-01's stub + tests, then diff.

This phase's mini-build depends on Phase 02's KV manager (mini_vllm/kv_cache.py), because the scheduler's whole interaction with memory is allocate_slots / get_computed_blocks / free. That dependency is the point: scheduling and paging are two halves of one machine.

The build, in order
Definition of done
Stretch (sets up later phases)

The build, in order

1. `SchedulerOutput`

A small dataclass: num_scheduled_tokens: dict[str,int], scheduled_requests: list[Request], preempted_req_ids: list[str], and a total_num_scheduled_tokens property. (Real: output.py:181, much richer.)

2. `Scheduler.init`

Hold the KVCacheManager, max_num_seqs, max_num_batched_tokens, long_prefill_token_threshold, and two queues: waiting: deque[Request], running: list[Request].

3. `_clamp_new_tokens(num_new_tokens, token_budget)`

Apply the long-prefill chunk cap (if 0 < threshold < num_new) then min(num_new, budget), floored at 0. This single helper is chunked prefill. (Real: scheduler.py:390–392.)

4. `schedule()` — the two-phase loop

Phase A (running): for each running request, n = _clamp_new_tokens(num_tokens − num_computed, budget); allocate_slots; on None, pop a victim from running, _preempt it, record it, retry; commit and budget -= n.
Phase B (waiting): while waiting and budget>0 and len(running)<max_num_seqs and not preempted: peek the front request; get_computed_blocks → set num_computed_tokens; n = _clamp_new_tokens(...); allocate_slots(req, n, computed_blocks); on None break; else move waiting→running, commit.

pytest mini_vllm/test_scheduler.py -q          # the reference suite (token budget, chunking,
                                               # preemption, prefix head-start)
pytest phase-03-continuous-batching-scheduler/labs -q

Then run the full engine and confirm the scheduler's correctness invariants hold end to end:

pytest mini_vllm/test_engine.py -q
# test_chunked_prefill_matches_unchunked_output and
# test_prefix_caching_matches_no_caching_output PROVE that these optimizations change
# *timing/memory*, never *output*. That property is the whole game.

Stretch (sets up later phases)

PRIORITY policy — add a priority field to Request and a PRIORITY branch to the preemption victim selection (max by (priority, arrival_time)), mirroring scheduler.py:456.
Swapping vs recompute — instead of resetting num_computed_tokens=0 on preempt, "swap" the blocks to a CPU pool and restore them on re-admit. Compare the cost model (recompute is compute, swap is bandwidth) — a real vLLM design axis.
Stats — count preemptions, average batch size, and KV usage per step; you'll need these in Phase 18.

vLLM Mastery — From Zero to Maintainer

Phase 03 — Mini-Build: the continuous-batching scheduler

Contents

The build, in order

1. `SchedulerOutput`

2. `Scheduler.init`

3. `_clamp_new_tokens(num_new_tokens, token_budget)`

4. `schedule()` — the two-phase loop

5. `_preempt(request)`

6. `update_from_output(output, sampled)`

7. `needs_sample(request, num_scheduled)` (static)

Definition of done

Stretch (sets up later phases)

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer