Phase 03 — Cheatsheet: Continuous Batching & Scheduler

The one-liner
The master model
schedule() shape
The four/five invariants
Knobs (→ Phase 18)
The Phase 02 ↔ 03 seam
Key upstream
Gotchas

The one-liner

Every token step, re-decide the batch: schedule RUNNING first, admit WAITING, under a token budget + seq-slot cap. Continuous batching, chunked prefill, prefix caching, preemption all fall out of "make num_computed_tokens catch up to num_tokens."

The master model

No prefill/decode phase. Request = (num_computed_tokens racing num_tokens). Prefill = far behind. Decode = behind by one. (scheduler.py:330)

schedule() shape

budget = max_num_batched_tokens
# A) RUNNING: n = clamp(num_tokens - num_computed, budget, threshold);
#    allocate_slots; None -> preempt running.pop(); retry; commit; budget -= n
# B) WAITING: while budget>0 and len(running)<max_num_seqs and not preempted:
#    get_computed_blocks (prefix cache) -> num_computed; clamp; allocate; None -> break; admit

The four/five invariants

a request is in exactly one of {waiting, running} while unfinished
sum(num_scheduled_tokens) <= max_num_batched_tokens
len(running) <= max_num_seqs
emits a token iff num_computed + num_scheduled == num_tokens
preempt frees KV + resets num_computed = 0 (recompute on re-admit)

Knobs (→ Phase 18)

max_num_batched_tokens — per-step token budget (chunked prefill granularity)
max_num_seqs — max concurrent running requests
long_prefill_token_threshold — per-request prefill chunk cap
enable_prefix_caching — share prefix KV across requests
scheduling policy — FCFS vs PRIORITY (preemption victim choice)

The Phase 02 ↔ 03 seam

Scheduler decides policy; KVCacheManager is truth. allocate_slots returns None on OOM → scheduler preempts + retries. Scheduler never touches blocks; KV manager never sets policy.

Key upstream

vllm/v1/core/sched/scheduler.py:329 — schedule()
:443 — preemption loop · :591 — prefix-cache head start
scheduler.py:1283 — update_from_output
vllm/v1/core/sched/output.py:181 — SchedulerOutput (New vs Cached request data)
vllm/v1/request.py:315 — RequestStatus

Gotchas

allocate_slots == None is normal control flow (drives preemption), not an error.
Admission stops on first OOM (break); running phase retries after preempting.
No admission in a step that preempted (avoid thrashing).
A request longer than the whole KV cache can never fit → ignored/aborted, not deadlock.

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

vLLM Mastery — From Zero to Maintainer