Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 03 — Cheatsheet: Continuous Batching & Scheduler

Contents


The one-liner

Every token step, re-decide the batch: schedule RUNNING first, admit WAITING, under a token budget + seq-slot cap. Continuous batching, chunked prefill, prefix caching, preemption all fall out of "make num_computed_tokens catch up to num_tokens."

The master model

No prefill/decode phase. Request = (num_computed_tokens racing num_tokens). Prefill = far behind. Decode = behind by one. (scheduler.py:330)

schedule() shape

budget = max_num_batched_tokens
# A) RUNNING: n = clamp(num_tokens - num_computed, budget, threshold);
#    allocate_slots; None -> preempt running.pop(); retry; commit; budget -= n
# B) WAITING: while budget>0 and len(running)<max_num_seqs and not preempted:
#    get_computed_blocks (prefix cache) -> num_computed; clamp; allocate; None -> break; admit

The four/five invariants

  1. a request is in exactly one of {waiting, running} while unfinished
  2. sum(num_scheduled_tokens) <= max_num_batched_tokens
  3. len(running) <= max_num_seqs
  4. emits a token iff num_computed + num_scheduled == num_tokens
  5. preempt frees KV + resets num_computed = 0 (recompute on re-admit)

Knobs (→ Phase 18)

  • max_num_batched_tokens — per-step token budget (chunked prefill granularity)
  • max_num_seqs — max concurrent running requests
  • long_prefill_token_threshold — per-request prefill chunk cap
  • enable_prefix_caching — share prefix KV across requests
  • scheduling policy — FCFS vs PRIORITY (preemption victim choice)

The Phase 02 ↔ 03 seam

Scheduler decides policy; KVCacheManager is truth. allocate_slots returns None on OOM → scheduler preempts + retries. Scheduler never touches blocks; KV manager never sets policy.

Key upstream

  • vllm/v1/core/sched/scheduler.py:329schedule()
  • :443 — preemption loop · :591 — prefix-cache head start
  • scheduler.py:1283update_from_output
  • vllm/v1/core/sched/output.py:181SchedulerOutput (New vs Cached request data)
  • vllm/v1/request.py:315RequestStatus

Gotchas

  • allocate_slots == None is normal control flow (drives preemption), not an error.
  • Admission stops on first OOM (break); running phase retries after preempting.
  • No admission in a step that preempted (avoid thrashing).
  • A request longer than the whole KV cache can never fit → ignored/aborted, not deadlock.

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md