Phase 03 — Interview Questions: Continuous Batching & Scheduler

Throughput questions live here. Cover the answer, attempt out loud, then compare. This and Phase 02 are the two topics to own cold.

Q1. What is continuous batching and why is it the biggest throughput win in LLM serving?

Model answer

Static batching runs a fixed batch to completion, so the GPU runs at the speed of the slowest request and finished requests waste their slot. Continuous batching re-decides the batch every single step (every token): the instant a request finishes, its slot is freed and a waiting request joins mid-flight. With mixed-length traffic (all real traffic) this keeps the GPU saturated continuously instead of idling on finished slots. It's purely a scheduling change — same kernels, same model — which is why it's such high leverage.

Q2. Explain the scheduler's core mental model.

Model answer

There's no "prefill phase" or "decode phase." Each request is just num_computed_tokens racing to catch up to num_tokens. Every step the scheduler hands out tokens so requests close that gap, under a global token budget. "Prefill" = far behind; "decode" = behind by one. This single rule covers chunked prefill (hand out part of the gap), prefix caching (start with the gap pre-closed), and speculative decoding (the gap includes draft tokens via num_tokens_with_spec) — all with no special cases. It's the comment at the top of Scheduler.schedule (scheduler.py:330).

Q3. What is chunked prefill and what problem does it solve?

Model answer

A long prompt's prefill, done in one step, would monopolize the step and stall every in-flight decode → inter-token-latency spikes for all current users. Chunked prefill splits the prefill across multiple steps under the per-step token budget (max_num_batched_tokens), so each step mixes a slice of the big prefill with ongoing decodes. It trades a bit of prefill throughput (more steps) for much better decode latency under load. Knob: long_prefill_token_threshold + the budget (scheduler.py:390).

Q4. How does prefix caching interact with the scheduler?

Model answer

When admitting a waiting request, the scheduler calls get_computed_blocks (scheduler.py:591), which asks the KV manager how many leading tokens are already cached (shared physical blocks from an earlier request with the same prefix). Those tokens count as already computed, so the request starts with num_computed_tokens > 0 and only prefills the unique remainder. For a shared system prompt across many users this is a massive throughput/memory win and the structural advantage behind multi-tenant serving. It rides on Phase 02's block sharing (touch + ref_cnt).

Q5. Walk me through what happens when a running request needs memory and there's none.

Model answer

allocate_slots returns None (Phase 02). The scheduler enters its preemption loop (scheduler.py:443): it picks a victim — under FCFS self.running.pop() (most recently admitted), under PRIORITY the worst (priority, arrival_time) — calls _preempt_request to free that request's KV blocks and send it back to waiting (to be recomputed later), then retries the allocation. If the only request left to preempt is the one we're trying to schedule, we give up on it this step. This None → preempt → retry handshake is what lets vLLM admit aggressively without OOM-crashing.

Q6. Preemption: recompute vs swap. Tradeoff?

Model answer

On preemption you can either recompute the KV later (replay prompt+generated tokens through prefill) or swap the KV blocks out to CPU memory and copy them back on resume. Recompute spends GPU compute (cheap-ish thanks to efficient prefill, no extra memory traffic off-GPU); swap spends PCIe bandwidth and CPU memory but avoids recomputation. Recompute usually wins for short sequences; swap can win for very long KV where recompute would be expensive. Either way, output is identical — preemption costs time, not correctness.

Q7. Why admit no new requests in a step where you preempted?

Model answer

A preemption means you're already out of KV memory. Admitting more work in the same step would immediately force more preemptions — thrashing. So the scheduler gates the waiting phase on "no preemptions this step" (scheduler.py:545; mini_vllm: not out.preempted_req_ids). It lets the system drain pressure before taking on more.

Q8. (Deep) How does speculative decoding ride this same scheduler with no special case?

Model answer

A request's num_tokens_with_spec includes proposed draft tokens, so the same num_new_tokens = num_tokens_with_spec - num_computed_tokens clamp naturally schedules the draft tokens to be verified, and num_lookahead_tokens reserves KV slots for them in allocate_slots. Acceptance/ rejection is handled in update_from_output. The scheduler doesn't know or care that it's spec decode — it's just "tokens to compute," exactly as the top-of-function comment promised. (Full treatment: Phase 08.)

Rapid-fire

Two queues? waiting (deque/priority) and running (list).
The per-step token cap? max_num_batched_tokens → token_budget.
The concurrent-sequence cap? max_num_seqs → len(running) limit.
Who's scheduled first each step? Running, then waiting.
What does update_from_output do? Append sampled tokens, advance num_computed_tokens, reap finished requests (free KV).
A request emits a token iff? num_computed_tokens + num_scheduled == num_tokens (prefill fully caught up).

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer