Phase 03 — The Hitchhiker's Guide to Continuous Batching & the Scheduler ⭐
← Phase 02 · Course home · Phase 04 →
Flagship phase — written in full. Phase 02 gave you the memory. This phase gives you the brain that decides who runs each step.
Contents
- Don't Panic
- Step 1: The big idea — there is no "prefill phase" or "decode phase"
- Step 2: Static batching (the bad old way) vs continuous batching
- Step 3: The token budget and chunked prefill
- Step 4: Prefix caching — the free head start
- Step 5: Preemption — the safety valve
- Step 6: The schedule, in order
- The invariants to memorize
- What you'll do in this phase
Don't Panic
The scheduler's whole job, once per token step, is to answer one question:
Given everyone who wants compute right now, and the memory I have, who runs this step and how many tokens does each get?
That's it. Everything famous about vLLM's throughput — continuous batching, chunked
prefill, prefix caching, preemption — is just a good answer to that one question,
computed fast, every single step. By the end of this phase you'll have written a working
continuous-batching scheduler (mini_vllm/scheduler.py) and read the real 2,300-line one
(upstream/vllm/v1/core/sched/scheduler.py).
Step 1: The big idea — there is no "prefill phase" or "decode phase"
This is the mental model the entire engine is built on. Read the real comment at the top of
Scheduler.schedule() (scheduler.py:330):
"There's no 'decoding phase' nor 'prefill phase' in the scheduler. Each request just has
num_computed_tokensandnum_tokens_with_spec. … At each step, the scheduler tries to assign tokens to the requests so that each request'snum_computed_tokenscan catch up to itsnum_tokens."
So every request is just a pair of numbers racing each other:
prompt = "Tell me a joke" (4 tokens, say)
num_tokens = 4, num_computed_tokens = 0
step: schedule 4 tokens ──► num_computed = 4 == num_tokens ─► emit 1 token ("Why")
num_tokens = 5, num_computed = 4
step: schedule 1 token ──► num_computed = 5 == num_tokens ─► emit 1 token ("did")
num_tokens = 6, num_computed = 5
...
"Prefill" is just "num_computed is far behind num_tokens." "Decode" is just "it's behind by
one, add one more." One uniform rule covers both. This is why chunked prefill, prefix
caching, and speculative decoding all fall out naturally instead of needing special cases.
Your mini_vllm/request.py is built around exactly this num_computed_tokens vs num_tokens
pair — go look.
Step 2: Static batching (the bad old way) vs continuous batching
Static batching: pick a batch of requests, run them together until they all finish, then start the next batch.
time ─►
req A (short): [#### done...................... idle ..............]
req B (med): [############# done............. idle ..............]
req C (long): [############################################ done ]
^ A finished here but its slot sits IDLE until C finishes.
The GPU runs at the speed of the slowest request in the batch, and finished requests waste their slot. Terrible utilization for mixed-length traffic (which is all real traffic).
Continuous batching: re-decide the batch every single step. The instant A finishes, its slot is freed and a waiting request D joins mid-flight.
time ─►
req A: [#### done]
req B: [#############done]
req C: [############################################done]
req D: [############### done] ← D joined the moment A left
req E: [######### done] ← E joined when B left
^ no idle slots; the GPU is always full.
This is the single biggest throughput win in modern LLM serving, and it's entirely a scheduling decision — same kernels, same model, just smarter batching. vLLM does this by default.
Step 3: The token budget and chunked prefill
If you let a brand-new 8,000-token prompt do its entire prefill in one step, every decode in flight stalls for that whole step → everyone's inter-token latency spikes. Bad.
The fix is a token budget per step: max_num_batched_tokens. The scheduler hands out at
most that many tokens total each step. A long prefill gets chunked — split across several
steps — so it shares each step with ongoing decodes instead of monopolizing one.
budget = 2048 tokens/step
step 1: [decode A:1][decode B:1] ... [prefill of new req: 2046 of its 8000 tokens]
step 2: [decode A:1][decode B:1] ... [prefill: next 2046 tokens]
... long prefill drips through the budget while decodes keep flowing.
In your mini_vllm/scheduler.py, this is _clamp_new_tokens (caps each request by the
remaining budget and by long_prefill_token_threshold) and the token_budget -= num_new_tokens
bookkeeping. The real code is the same idea at scheduler.py:348 (token_budget = self.max_num_scheduled_tokens) and :390 (long_prefill_token_threshold).
Step 4: Prefix caching — the free head start
Remember from Phase 02 that requests can share physical KV blocks. The scheduler exploits this
when admitting a waiting request: before allocating, it asks the KV manager
"how much of this prompt is already computed?" (get_computed_blocks). If a shared prefix is
cached, those tokens are already done — the request starts with num_computed_tokens > 0
for free.
Request A ran earlier with prompt "You are a helpful assistant. <Q1>" → its prefix blocks cached
Request B arrives with prompt "You are a helpful assistant. <Q2>"
scheduler: get_computed_blocks(B) → 6 blocks hit (the shared system prompt)
B starts with num_computed_tokens = 96, only needs to prefill <Q2>.
For a shared 2k-token system prompt across thousands of users, this is enormous — it's the
structural cost advantage behind multi-tenant serving. In mini_vllm, the WAITING loop calls
self.kv.get_computed_blocks(request) and sets request.num_computed_tokens = num_cached.
Real code: scheduler.py:591.
Step 5: Preemption — the safety valve
Continuous batching admits requests aggressively to keep the GPU full. Sometimes that means a running request needs another KV block and there are none left. What then?
The scheduler preempts: it evicts a running request (frees its KV blocks back to the pool), puts it back on the waiting queue, and gives its memory to someone who can make progress now. The preempted request will be recomputed later when memory frees up (its prompt + generated tokens are replayed; this is cheaper than it sounds thanks to prefill efficiency).
running: [A][B][C] free blocks: 0
C needs 1 more block → allocate_slots(C) returns None (OOM, Phase 02!)
→ preempt the most-recently-added running request (say C, or the lowest priority)
→ free its blocks, push it back to WAITING, retry
This is the None → preempt → retry handshake from Phase 02, seen from the scheduler's
side. In mini_vllm/scheduler.py it's the while True: loop around allocate_slots that pops
a victim from self.running and calls _preempt. Real code: scheduler.py:443–491.
Step 6: The schedule, in order
Putting it together, here's the shape of schedule() (yours and theirs):
token_budget = max_num_batched_tokens
# 1) RUNNING first — keep decodes flowing
for request in running:
n = clamp(request.num_tokens - request.num_computed_tokens, budget, prefill_threshold)
blocks = allocate_slots(request, n)
while blocks is None: # OOM
preempt(running.pop()); retry
schedule it; budget -= n
# 2) WAITING next — admit new work if budget + memory + seq-slots remain
while waiting and budget > 0 and len(running) < max_num_seqs and not preempted_this_step:
request = waiting[0]
computed_blocks, num_cached = get_computed_blocks(request) # prefix cache
request.num_computed_tokens = num_cached
n = clamp(request.num_tokens - num_cached, budget, prefill_threshold)
blocks = allocate_slots(request, n, computed_blocks)
if blocks is None: break # no memory to admit anyone else
move request waiting → running; budget -= n
Two subtleties worth noting now (and seeing in the real code):
- Running before waiting: progress for in-flight requests beats starting new ones (latency).
- No admit after a preemption this step: if we just had to preempt, the system is under
memory pressure — don't pour more in. (
mini_vllm:and not out.preempted_req_ids; real:scheduler.py:545if not preempted_reqs ...).
The invariants to memorize
- A request is always either in
waitingorrunning(never both, never neither while unfinished). sum(num_scheduled_tokens) ≤ max_num_batched_tokensevery step (the budget holds).len(running) ≤ max_num_seqsevery step.- A scheduled request emits a token this step iff
num_computed_tokens + num_scheduled == num_tokens(prefill fully caught up). Mid-prefill chunks emit nothing. - Preemption frees KV and resets
num_computed_tokens = 0(full recompute on re-admit).
What you'll do in this phase
- Read: 01-deep-dive.md walks the real
schedule()line by line. - Build: 02-mini-build.md — write the scheduler (reference:
mini_vllm/scheduler.py). - Labs (see labs/README.md; recommended order 01 → 02 → 05 → 04 → 06 → 03):
lab-01-scheduler-step[CPU-OK]— implement the budget + running/waiting loop; pass the tests.lab-02-chunked-prefill[CPU-OK]— prove chunking changes timing, not output, and predict step counts.lab-03-prefix-cache-hitrate[GPU-OPT]— measure real prefix-cache hit rate and its memory effect.lab-04-preemption[CPU-OK]— force a preemption and prove the request still completes correctly.lab-05-decode-latency-spikes[CPU-OK]— measure the ITL spike a long prefill inflicts on a decode stream ([257, 2, 1, ...]) and how the chunk threshold caps it ([33,×8, ...]).lab-06-prefix-cache-savings[CPU-OK]— account for prefix caching to the exact token (544 vs 96 scheduled tokens; savings ≡ followers × shared full blocks), outputs identical.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
When you can whiteboard schedule() and explain the budget, chunking, prefix head-start, and
preemption handshake from memory, you understand the component that defines vLLM's throughput.
← Phase 02 · Course home · Phase 04 →