Phase 03 — Cheatsheet: Continuous Batching & Scheduler
Contents
- The one-liner
- The master model
- schedule() shape
- The four/five invariants
- Knobs (→ Phase 18)
- The Phase 02 ↔ 03 seam
- Key upstream
- Gotchas
The one-liner
Every token step, re-decide the batch: schedule RUNNING first, admit WAITING, under a token
budget + seq-slot cap. Continuous batching, chunked prefill, prefix caching, preemption all fall
out of "make num_computed_tokens catch up to num_tokens."
The master model
No prefill/decode phase. Request = (num_computed_tokens racing num_tokens).
Prefill = far behind. Decode = behind by one. (scheduler.py:330)
schedule() shape
budget = max_num_batched_tokens
# A) RUNNING: n = clamp(num_tokens - num_computed, budget, threshold);
# allocate_slots; None -> preempt running.pop(); retry; commit; budget -= n
# B) WAITING: while budget>0 and len(running)<max_num_seqs and not preempted:
# get_computed_blocks (prefix cache) -> num_computed; clamp; allocate; None -> break; admit
The four/five invariants
- a request is in exactly one of {waiting, running} while unfinished
sum(num_scheduled_tokens) <= max_num_batched_tokenslen(running) <= max_num_seqs- emits a token iff
num_computed + num_scheduled == num_tokens - preempt frees KV + resets
num_computed = 0(recompute on re-admit)
Knobs (→ Phase 18)
max_num_batched_tokens— per-step token budget (chunked prefill granularity)max_num_seqs— max concurrent running requestslong_prefill_token_threshold— per-request prefill chunk capenable_prefix_caching— share prefix KV across requests- scheduling policy — FCFS vs PRIORITY (preemption victim choice)
The Phase 02 ↔ 03 seam
Scheduler decides policy; KVCacheManager is truth. allocate_slots returns None on OOM
→ scheduler preempts + retries. Scheduler never touches blocks; KV manager never sets policy.
Key upstream
vllm/v1/core/sched/scheduler.py:329—schedule():443— preemption loop ·:591— prefix-cache head startscheduler.py:1283—update_from_outputvllm/v1/core/sched/output.py:181—SchedulerOutput(New vs Cached request data)vllm/v1/request.py:315—RequestStatus
Gotchas
allocate_slots == Noneis normal control flow (drives preemption), not an error.- Admission stops on first OOM (
break); running phase retries after preempting. - No admission in a step that preempted (avoid thrashing).
- A request longer than the whole KV cache can never fit → ignored/aborted, not deadlock.
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md