Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 03 — The Hitchhiker's Guide to Continuous Batching & the Scheduler ⭐

Phase 02 · Course home · Phase 04

Flagship phase — written in full. Phase 02 gave you the memory. This phase gives you the brain that decides who runs each step.

Contents


Don't Panic

The scheduler's whole job, once per token step, is to answer one question:

Given everyone who wants compute right now, and the memory I have, who runs this step and how many tokens does each get?

That's it. Everything famous about vLLM's throughput — continuous batching, chunked prefill, prefix caching, preemption — is just a good answer to that one question, computed fast, every single step. By the end of this phase you'll have written a working continuous-batching scheduler (mini_vllm/scheduler.py) and read the real 2,300-line one (upstream/vllm/v1/core/sched/scheduler.py).


Step 1: The big idea — there is no "prefill phase" or "decode phase"

This is the mental model the entire engine is built on. Read the real comment at the top of Scheduler.schedule() (scheduler.py:330):

"There's no 'decoding phase' nor 'prefill phase' in the scheduler. Each request just has num_computed_tokens and num_tokens_with_spec. … At each step, the scheduler tries to assign tokens to the requests so that each request's num_computed_tokens can catch up to its num_tokens."

So every request is just a pair of numbers racing each other:

prompt = "Tell me a joke"          (4 tokens, say)
                                    num_tokens = 4,  num_computed_tokens = 0

step: schedule 4 tokens  ──►       num_computed = 4  == num_tokens  ─► emit 1 token ("Why")
                                    num_tokens = 5,  num_computed = 4
step: schedule 1 token   ──►       num_computed = 5  == num_tokens  ─► emit 1 token ("did")
                                    num_tokens = 6,  num_computed = 5
...

"Prefill" is just "num_computed is far behind num_tokens." "Decode" is just "it's behind by one, add one more." One uniform rule covers both. This is why chunked prefill, prefix caching, and speculative decoding all fall out naturally instead of needing special cases. Your mini_vllm/request.py is built around exactly this num_computed_tokens vs num_tokens pair — go look.


Step 2: Static batching (the bad old way) vs continuous batching

Static batching: pick a batch of requests, run them together until they all finish, then start the next batch.

time ─►
req A (short):  [#### done...................... idle ..............]
req B (med):    [############# done............. idle ..............]
req C (long):   [############################################ done ]
                 ^ A finished here but its slot sits IDLE until C finishes.

The GPU runs at the speed of the slowest request in the batch, and finished requests waste their slot. Terrible utilization for mixed-length traffic (which is all real traffic).

Continuous batching: re-decide the batch every single step. The instant A finishes, its slot is freed and a waiting request D joins mid-flight.

time ─►
req A:  [#### done]
req B:  [#############done]
req C:  [############################################done]
req D:       [############### done]      ← D joined the moment A left
req E:              [######### done]      ← E joined when B left
                 ^ no idle slots; the GPU is always full.

This is the single biggest throughput win in modern LLM serving, and it's entirely a scheduling decision — same kernels, same model, just smarter batching. vLLM does this by default.


Step 3: The token budget and chunked prefill

If you let a brand-new 8,000-token prompt do its entire prefill in one step, every decode in flight stalls for that whole step → everyone's inter-token latency spikes. Bad.

The fix is a token budget per step: max_num_batched_tokens. The scheduler hands out at most that many tokens total each step. A long prefill gets chunked — split across several steps — so it shares each step with ongoing decodes instead of monopolizing one.

budget = 2048 tokens/step

step 1: [decode A:1][decode B:1] ... [prefill of new req: 2046 of its 8000 tokens]
step 2: [decode A:1][decode B:1] ... [prefill: next 2046 tokens]
...     long prefill drips through the budget while decodes keep flowing.

In your mini_vllm/scheduler.py, this is _clamp_new_tokens (caps each request by the remaining budget and by long_prefill_token_threshold) and the token_budget -= num_new_tokens bookkeeping. The real code is the same idea at scheduler.py:348 (token_budget = self.max_num_scheduled_tokens) and :390 (long_prefill_token_threshold).


Step 4: Prefix caching — the free head start

Remember from Phase 02 that requests can share physical KV blocks. The scheduler exploits this when admitting a waiting request: before allocating, it asks the KV manager "how much of this prompt is already computed?" (get_computed_blocks). If a shared prefix is cached, those tokens are already done — the request starts with num_computed_tokens > 0 for free.

Request A ran earlier with prompt "You are a helpful assistant. <Q1>"  → its prefix blocks cached
Request B arrives with prompt   "You are a helpful assistant. <Q2>"
  scheduler: get_computed_blocks(B) → 6 blocks hit (the shared system prompt)
  B starts with num_computed_tokens = 96, only needs to prefill <Q2>.

For a shared 2k-token system prompt across thousands of users, this is enormous — it's the structural cost advantage behind multi-tenant serving. In mini_vllm, the WAITING loop calls self.kv.get_computed_blocks(request) and sets request.num_computed_tokens = num_cached. Real code: scheduler.py:591.


Step 5: Preemption — the safety valve

Continuous batching admits requests aggressively to keep the GPU full. Sometimes that means a running request needs another KV block and there are none left. What then?

The scheduler preempts: it evicts a running request (frees its KV blocks back to the pool), puts it back on the waiting queue, and gives its memory to someone who can make progress now. The preempted request will be recomputed later when memory frees up (its prompt + generated tokens are replayed; this is cheaper than it sounds thanks to prefill efficiency).

running: [A][B][C]   free blocks: 0
C needs 1 more block → allocate_slots(C) returns None (OOM, Phase 02!)
  → preempt the most-recently-added running request (say C, or the lowest priority)
  → free its blocks, push it back to WAITING, retry

This is the None → preempt → retry handshake from Phase 02, seen from the scheduler's side. In mini_vllm/scheduler.py it's the while True: loop around allocate_slots that pops a victim from self.running and calls _preempt. Real code: scheduler.py:443–491.


Step 6: The schedule, in order

Putting it together, here's the shape of schedule() (yours and theirs):

token_budget = max_num_batched_tokens

# 1) RUNNING first — keep decodes flowing
for request in running:
    n = clamp(request.num_tokens - request.num_computed_tokens, budget, prefill_threshold)
    blocks = allocate_slots(request, n)
    while blocks is None:              # OOM
        preempt(running.pop()); retry
    schedule it; budget -= n

# 2) WAITING next — admit new work if budget + memory + seq-slots remain
while waiting and budget > 0 and len(running) < max_num_seqs and not preempted_this_step:
    request = waiting[0]
    computed_blocks, num_cached = get_computed_blocks(request)   # prefix cache
    request.num_computed_tokens = num_cached
    n = clamp(request.num_tokens - num_cached, budget, prefill_threshold)
    blocks = allocate_slots(request, n, computed_blocks)
    if blocks is None: break          # no memory to admit anyone else
    move request waiting → running; budget -= n

Two subtleties worth noting now (and seeing in the real code):

  • Running before waiting: progress for in-flight requests beats starting new ones (latency).
  • No admit after a preemption this step: if we just had to preempt, the system is under memory pressure — don't pour more in. (mini_vllm: and not out.preempted_req_ids; real: scheduler.py:545 if not preempted_reqs ...).

The invariants to memorize

  1. A request is always either in waiting or running (never both, never neither while unfinished).
  2. sum(num_scheduled_tokens) ≤ max_num_batched_tokens every step (the budget holds).
  3. len(running) ≤ max_num_seqs every step.
  4. A scheduled request emits a token this step iff num_computed_tokens + num_scheduled == num_tokens (prefill fully caught up). Mid-prefill chunks emit nothing.
  5. Preemption frees KV and resets num_computed_tokens = 0 (full recompute on re-admit).

What you'll do in this phase

  • Read: 01-deep-dive.md walks the real schedule() line by line.
  • Build: 02-mini-build.md — write the scheduler (reference: mini_vllm/scheduler.py).
  • Labs (see labs/README.md; recommended order 01 → 02 → 05 → 04 → 06 → 03):
    • lab-01-scheduler-step [CPU-OK] — implement the budget + running/waiting loop; pass the tests.
    • lab-02-chunked-prefill [CPU-OK] — prove chunking changes timing, not output, and predict step counts.
    • lab-03-prefix-cache-hitrate [GPU-OPT] — measure real prefix-cache hit rate and its memory effect.
    • lab-04-preemption [CPU-OK] — force a preemption and prove the request still completes correctly.
    • lab-05-decode-latency-spikes [CPU-OK] — measure the ITL spike a long prefill inflicts on a decode stream ([257, 2, 1, ...]) and how the chunk threshold caps it ([33,×8, ...]).
    • lab-06-prefix-cache-savings [CPU-OK] — account for prefix caching to the exact token (544 vs 96 scheduled tokens; savings ≡ followers × shared full blocks), outputs identical.
  • Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

When you can whiteboard schedule() and explain the budget, chunking, prefix head-start, and preemption handshake from memory, you understand the component that defines vLLM's throughput.

Phase 02 · Course home · Phase 04