Phase 03 — Interview Questions: Continuous Batching & Scheduler
Throughput questions live here. Cover the answer, attempt out loud, then compare. This and Phase 02 are the two topics to own cold.
Q1. What is continuous batching and why is it the biggest throughput win in LLM serving?
Model answer
Static batching runs a fixed batch to completion, so the GPU runs at the speed of the slowest request and finished requests waste their slot. Continuous batching re-decides the batch every single step (every token): the instant a request finishes, its slot is freed and a waiting request joins mid-flight. With mixed-length traffic (all real traffic) this keeps the GPU saturated continuously instead of idling on finished slots. It's purely a scheduling change — same kernels, same model — which is why it's such high leverage.
Q2. Explain the scheduler's core mental model.
Model answer
There's no "prefill phase" or "decode phase." Each request is just num_computed_tokens racing
to catch up to num_tokens. Every step the scheduler hands out tokens so requests close that
gap, under a global token budget. "Prefill" = far behind; "decode" = behind by one. This single
rule covers chunked prefill (hand out part of the gap), prefix caching (start with the gap
pre-closed), and speculative decoding (the gap includes draft tokens via num_tokens_with_spec)
— all with no special cases. It's the comment at the top of Scheduler.schedule
(scheduler.py:330).
Q3. What is chunked prefill and what problem does it solve?
Model answer
A long prompt's prefill, done in one step, would monopolize the step and stall every in-flight
decode → inter-token-latency spikes for all current users. Chunked prefill splits the prefill
across multiple steps under the per-step token budget (max_num_batched_tokens), so each step
mixes a slice of the big prefill with ongoing decodes. It trades a bit of prefill throughput
(more steps) for much better decode latency under load. Knob:
long_prefill_token_threshold + the budget (scheduler.py:390).
Q4. How does prefix caching interact with the scheduler?
Model answer
When admitting a waiting request, the scheduler calls get_computed_blocks
(scheduler.py:591), which asks the KV manager how many leading tokens are already cached
(shared physical blocks from an earlier request with the same prefix). Those tokens count as
already computed, so the request starts with num_computed_tokens > 0 and only prefills the
unique remainder. For a shared system prompt across many users this is a massive
throughput/memory win and the structural advantage behind multi-tenant serving. It rides on
Phase 02's block sharing (touch + ref_cnt).
Q5. Walk me through what happens when a running request needs memory and there's none.
Model answer
allocate_slots returns None (Phase 02). The scheduler enters its preemption loop
(scheduler.py:443): it picks a victim — under FCFS self.running.pop() (most recently
admitted), under PRIORITY the worst (priority, arrival_time) — calls _preempt_request to
free that request's KV blocks and send it back to waiting (to be recomputed later), then retries
the allocation. If the only request left to preempt is the one we're trying to schedule, we give
up on it this step. This None → preempt → retry handshake is what lets vLLM admit aggressively
without OOM-crashing.
Q6. Preemption: recompute vs swap. Tradeoff?
Model answer
On preemption you can either recompute the KV later (replay prompt+generated tokens through prefill) or swap the KV blocks out to CPU memory and copy them back on resume. Recompute spends GPU compute (cheap-ish thanks to efficient prefill, no extra memory traffic off-GPU); swap spends PCIe bandwidth and CPU memory but avoids recomputation. Recompute usually wins for short sequences; swap can win for very long KV where recompute would be expensive. Either way, output is identical — preemption costs time, not correctness.
Q7. Why admit no new requests in a step where you preempted?
Model answer
A preemption means you're already out of KV memory. Admitting more work in the same step would
immediately force more preemptions — thrashing. So the scheduler gates the waiting phase on "no
preemptions this step" (scheduler.py:545; mini_vllm: not out.preempted_req_ids). It lets
the system drain pressure before taking on more.
Q8. (Deep) How does speculative decoding ride this same scheduler with no special case?
Model answer
A request's num_tokens_with_spec includes proposed draft tokens, so the same num_new_tokens = num_tokens_with_spec - num_computed_tokens clamp naturally schedules the draft tokens to be
verified, and num_lookahead_tokens reserves KV slots for them in allocate_slots. Acceptance/
rejection is handled in update_from_output. The scheduler doesn't know or care that it's spec
decode — it's just "tokens to compute," exactly as the top-of-function comment promised. (Full
treatment: Phase 08.)
Rapid-fire
- Two queues?
waiting(deque/priority) andrunning(list). - The per-step token cap?
max_num_batched_tokens→token_budget. - The concurrent-sequence cap?
max_num_seqs→len(running)limit. - Who's scheduled first each step? Running, then waiting.
- What does
update_from_outputdo? Append sampled tokens, advancenum_computed_tokens, reap finished requests (free KV). - A request emits a token iff?
num_computed_tokens + num_scheduled == num_tokens(prefill fully caught up).