Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 03 — Mini-Build: the continuous-batching scheduler

You'll build the scheduler that drives mini_vllm. The reference is already in the repo — mini_vllm/scheduler.py — but write it yourself first against lab-01's stub + tests, then diff.

This phase's mini-build depends on Phase 02's KV manager (mini_vllm/kv_cache.py), because the scheduler's whole interaction with memory is allocate_slots / get_computed_blocks / free. That dependency is the point: scheduling and paging are two halves of one machine.

Contents


The build, in order

1. SchedulerOutput

A small dataclass: num_scheduled_tokens: dict[str,int], scheduled_requests: list[Request], preempted_req_ids: list[str], and a total_num_scheduled_tokens property. (Real: output.py:181, much richer.)

2. Scheduler.__init__

Hold the KVCacheManager, max_num_seqs, max_num_batched_tokens, long_prefill_token_threshold, and two queues: waiting: deque[Request], running: list[Request].

3. _clamp_new_tokens(num_new_tokens, token_budget)

Apply the long-prefill chunk cap (if 0 < threshold < num_new) then min(num_new, budget), floored at 0. This single helper is chunked prefill. (Real: scheduler.py:390–392.)

4. schedule() — the two-phase loop

  • Phase A (running): for each running request, n = _clamp_new_tokens(num_tokens − num_computed, budget); allocate_slots; on None, pop a victim from running, _preempt it, record it, retry; commit and budget -= n.
  • Phase B (waiting): while waiting and budget>0 and len(running)<max_num_seqs and not preempted: peek the front request; get_computed_blocks → set num_computed_tokens; n = _clamp_new_tokens(...); allocate_slots(req, n, computed_blocks); on None break; else move waiting→running, commit.

5. _preempt(request)

kv.free(request); num_computed_tokens = 0; num_preemptions += 1; status PREEMPTED; waiting.appendleft(request) (re-admit ASAP).

6. update_from_output(output, sampled)

For each scheduled request: num_computed_tokens += n; if it sampled, append the token and maybe_finish(). Reap finished requests: remove from running, kv.free.

7. needs_sample(request, num_scheduled) (static)

return request.num_computed_tokens + num_scheduled == request.num_tokens. Only fully-caught-up requests emit a token (mid-prefill chunks don't).

Definition of done

pytest mini_vllm/test_scheduler.py -q          # the reference suite (token budget, chunking,
                                               # preemption, prefix head-start)
pytest phase-03-continuous-batching-scheduler/labs -q

Then run the full engine and confirm the scheduler's correctness invariants hold end to end:

pytest mini_vllm/test_engine.py -q
# test_chunked_prefill_matches_unchunked_output and
# test_prefix_caching_matches_no_caching_output PROVE that these optimizations change
# *timing/memory*, never *output*. That property is the whole game.

Stretch (sets up later phases)

  1. PRIORITY policy — add a priority field to Request and a PRIORITY branch to the preemption victim selection (max by (priority, arrival_time)), mirroring scheduler.py:456.
  2. Swapping vs recompute — instead of resetting num_computed_tokens=0 on preempt, "swap" the blocks to a CPU pool and restore them on re-admit. Compare the cost model (recompute is compute, swap is bandwidth) — a real vLLM design axis.
  3. Stats — count preemptions, average batch size, and KV usage per step; you'll need these in Phase 18.