Phase 03 — Exercises: Continuous Batching & Scheduler
Escalating from "explain it" to "design it." Staff-level = the last ones cold, citing the exact
upstream/ line.
Contents
Warm-up (explain)
- Restate, in your own words, the "no prefill/decode phase" idea. What two numbers on
Requestdoes the whole scheduler manipulate? - Why schedule RUNNING requests before admitting WAITING ones?
- What three conditions must all hold to admit a waiting request? (guide §6 / deep-dive §2.)
Core (trace the code)
- Walk the 4-line
num_new_tokensclamp (scheduler.py:385–398). Name each of the caps and give a scenario where each one is the binding constraint. - Trace the
while Truepreemption loop for: 3 running requests, 0 free blocks, FCFS policy. Who gets preempted, and what does_preempt_requestdo to them? - In
mini_vllm, why does admission stop entirely (break) on the firstallocate_slots == None, while the running phase retries after preempting? (Hint: different goals — make progress vs. don't over-admit.)
Build (extend your code)
- Implement the PRIORITY policy in
mini_vllm/scheduler.py: apriorityonRequestand a victim =max(running, key=lambda r:(r.priority, r.arrival_time)). Write a test where a high-priority late arrival preempts a low-priority running request. - Add a stats counter: total preemptions, average batch size, mean KV usage per step. Verify on lab-04's cramped run that preemptions > 0.
- Implement swapping preemption: instead of
num_computed_tokens = 0, move the request's blocks to a CPU list and restore on re-admit. Show output is still identical; discuss the cost difference vs recompute.
Design (staff-level)
- A workload is 90% short chats (200-token prompts) and 10% long-doc summaries (16k prompts).
Pick
max_num_batched_tokensandlong_prefill_token_thresholdand justify with the latency impact on the short chats while the long prefills run. - You see frequent preemptions in production. List the three knobs you'd reach for (and the code/Phase each maps to) and the risk of each.
- Continuous batching keeps the GPU full, but at very high concurrency latency degrades.
Use Little's Law to explain the tradeoff and where you'd cap
max_num_seqs. - Design admission control to prevent a request that can never fit (longer than the whole
KV cache) from deadlocking the engine. What does real vLLM do? (Peek:
FINISHED_IGNORED,check_enough_kv_cache_memoryatkv_cache_utils.py:794.)
Self-grading
5, 10–13 are interview-grade. Could you whiteboard each in 5 minutes and name the file? If not, re-read the matching deep-dive section, then drill INTERVIEW.md.