Phase 03 — Exercises: Continuous Batching & Scheduler

Escalating from "explain it" to "design it." Staff-level = the last ones cold, citing the exact upstream/ line.

Warm-up (explain)
Core (trace the code)
Build (extend your code)
Design (staff-level)
Self-grading

Warm-up (explain)

Restate, in your own words, the "no prefill/decode phase" idea. What two numbers on Request does the whole scheduler manipulate?
Why schedule RUNNING requests before admitting WAITING ones?
What three conditions must all hold to admit a waiting request? (guide §6 / deep-dive §2.)

Core (trace the code)

Walk the 4-line num_new_tokens clamp (scheduler.py:385–398). Name each of the caps and give a scenario where each one is the binding constraint.
Trace the while True preemption loop for: 3 running requests, 0 free blocks, FCFS policy. Who gets preempted, and what does _preempt_request do to them?
In mini_vllm, why does admission stop entirely (break) on the first allocate_slots == None, while the running phase retries after preempting? (Hint: different goals — make progress vs. don't over-admit.)

Build (extend your code)

Implement the PRIORITY policy in mini_vllm/scheduler.py: a priority on Request and a victim = max(running, key=lambda r:(r.priority, r.arrival_time)). Write a test where a high-priority late arrival preempts a low-priority running request.
Add a stats counter: total preemptions, average batch size, mean KV usage per step. Verify on lab-04's cramped run that preemptions > 0.
Implement swapping preemption: instead of num_computed_tokens = 0, move the request's blocks to a CPU list and restore on re-admit. Show output is still identical; discuss the cost difference vs recompute.

Design (staff-level)

A workload is 90% short chats (200-token prompts) and 10% long-doc summaries (16k prompts). Pick max_num_batched_tokens and long_prefill_token_threshold and justify with the latency impact on the short chats while the long prefills run.
You see frequent preemptions in production. List the three knobs you'd reach for (and the code/Phase each maps to) and the risk of each.
Continuous batching keeps the GPU full, but at very high concurrency latency degrades. Use Little's Law to explain the tradeoff and where you'd cap max_num_seqs.
Design admission control to prevent a request that can never fit (longer than the whole KV cache) from deadlocking the engine. What does real vLLM do? (Peek: FINISHED_IGNORED, check_enough_kv_cache_memory at kv_cache_utils.py:794.)

Self-grading

5, 10–13 are interview-grade. Could you whiteboard each in 5 minutes and name the file? If not, re-read the matching deep-dive section, then drill INTERVIEW.md.

vLLM Mastery — From Zero to Maintainer

Phase 03 — Exercises: Continuous Batching & Scheduler

Contents

Warm-up (explain)

Core (trace the code)

Build (extend your code)

Design (staff-level)

Self-grading

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 03 — Exercises: Continuous Batching & Scheduler

Contents

Warm-up (explain)

Core (trace the code)

Build (extend your code)

Design (staff-level)

Self-grading