Lab 03-04 — Preemption: Survive Memory Pressure [CPU-OK]
Every admission decision the scheduler makes is a bet: "this request will fit." Decode growth makes the bet probabilistic — every running request gets one token longer per step, and nobody knows when they'll stop. Preemption is what happens when the bets go bad: the scheduler evicts a running request mid-generation, frees its KV, and re-runs it later — and the user at the other end must never be able to tell. In this lab you'll force that emergency on purpose, in a pool sized so two requests cannot both finish, and prove the two halves of the contract: a preemption really occurs, and the preempted request's final output is token-for-token identical to what it would have produced with infinite memory.
Memory pressure costs time, never correctness. That's the sentence this lab turns from a slogan into a test.
Contents
- Why this lab exists
- Background: why overcommit at all
- The mechanism, step by step
- Files
- Run
- The setup that forces preemption
- Why the output is identical
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Preemption is the scheduler's least-exercised, most-critical path — the firmware of the fire extinguisher. It runs rarely (well-tuned deployments preempt little), which means bugs in it survive for months, and when it finally runs, it runs during the worst possible conditions: full memory, maximum load, every user watching. An engineer who has caused preemptions in a controlled pool, watched the counters reset, and verified the replayed output, debugs a production preemption storm from knowledge; everyone else debugs it from folklore.
There's also a design lesson here worth the price of the lab on its own: vLLM turns a potential correctness catastrophe (OOM mid-generation) into a pure performance event (recompute later). That transformation — push failures down the severity ladder, from "wrong answer" to "crash" to "slow" — is the signature move of robust systems design, and preemption is its cleanest specimen in this codebase.
Background: why overcommit at all
The timid alternative exists: admit a request only if prompt + max_tokens worth of blocks
can be reserved up front. No preemption needed, ever. But you built lab Phase-2-02, so
you can name the cost: that's max-reservation again through the back door — requests
typically generate a fraction of max_tokens, so reserved-but-unused blocks strangle
concurrency exactly the way contiguous allocation did. vLLM chooses to admit
optimistically (reserve nothing beyond current need, let requests grow block by block)
and handle the rare collision with preemption. More throughput every step, plus an
occasional recompute tax, beats less throughput always. But optimism requires a safety
valve — and the valve must preserve correctness, or the whole bargain is rotten. Hence
this lab's two-sided test.
The mechanism, step by step
From mini_vllm/scheduler.py (the RUNNING phase's while True), mirroring upstream:
- A running request needs a block for its next token;
allocate_slotsreturnsNone— the pool is empty. This is the bad bet coming due. - The scheduler picks a victim: the last request in
running— the most recently admitted (under FCFS, the lowest-priority / least-progressed; upstream with priority scheduling picks lowest-priority-then-latest). Note the same principle as lab-01: which end of which list you take from IS the policy. _preempt(victim): free all the victim's blocks (back to the free queue — Phase 2 mechanics), resetnum_computed_tokens = 0, keepoutput_token_ids(this is the crown jewel — see below), status →PREEMPTED, and push it on the front of the waiting queue (it has waited longest; it re-enters first).- Retry the allocation. Repeat — possibly preempting several — until it fits. Degenerate
case: the victim is yourself; then you give up this step (you're now first in
waiting, and you'll be re-admitted when memory frees). - No admissions on a step that preempted (
not out.preempted_req_idsguards the WAITING phase) — re-admitting while evicting would thrash.
Files
starter.py— implementrun(prompts, num_blocks, block_size, max_tokens): drivemini_vllm.LLMEnginewith a given pool size and return each prompt's output token ids. Deliberately thin — the test design is the lab. Your work.solution.py— reference.test_lab.py— (1) cramped-pool outputs == roomy-pool outputs; (2) a direct scheduler test that a preemption actually occurs under pressure.
Run
LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-04-preemption -q
pytest phase-03-continuous-batching-scheduler/labs/lab-04-preemption -q # reference
The setup that forces preemption
Arithmetic you can check on your fingers — pool of 5 blocks × 4 slots = 20 slots, minus the null block → 16 usable slots. Two requests, each 8 prompt tokens + up to 20 output: each needs 3 blocks just to get past its first decodes (tokens 9..12 spill into block 3). Both admit fine (2 blocks each = 4 of 4 blocks — the optimistic bet). Both decode a few tokens... then one needs its third block. Free blocks: zero. The scheduler preempts the most-recent admission, lets the survivor finish (its blocks free at completion — the reaping path from Phase 1), then re-admits the victim, which re-prefills from scratch and finishes too. Total cost: one extra prefill. Total damage to output: zero — that's the assertion.
The direct scheduler test stages the same squeeze without the engine: schedule once (both
admitted), manually advance both requests past their prompts, schedule again — and assert
out.preempted_req_ids is non-empty. One test proves the valve opens; the other proves
nothing leaks when it does.
Why the output is identical
The victim's output_token_ids survive preemption; only num_computed_tokens resets. On
re-admission, the request's token list is prompt + outputs_so_far, and the engine — with
zero special-case code — simply sees a request whose counter is far behind and prefills
the whole thing, generated tokens included. When the counter catches up, the next
sample happens at the same (last_token, position) state as if nothing had happened, and
the deterministic model continues identically. Induction does the rest.
Stop and admire the design economy: recovery is just prefill. The two-counters model makes "resume after eviction" indistinguishable from "admit a long prompt" — same code path as lab-02's chunking, same path as lab-06's cache hits. Three features, zero interaction code. When you design state machines, this is what to copy: make recovery a state the normal path already handles, not a parallel universe of special cases. (Real vLLM preserves correctness the same way; with prefix caching on, the recompute may even hit surviving cached blocks of its own prompt and skip most of the work.)
What the tests prove
| Test | The half of the contract it pins |
|---|---|
test_cramped_pool_matches_roomy_pool | Correctness: 5-block pool and 256-block pool produce identical token ids, and both requests reach their full max_tokens. The user cannot detect preemption from outputs. |
test_preemption_actually_happens_under_pressure | Liveness of the test itself: a preemption really fires in this scenario. Without this, the first test could pass vacuously (pool accidentally roomy, nothing preempted, nothing proven). Pair every "X survives Y" test with a "Y actually happened" probe — untriggered safety tests are the unit-test equivalent of an unplugged smoke detector. |
Hitchhiker's notes
- Recompute vs swap. vLLM can alternatively swap a victim's KV to CPU RAM and copy it back later, instead of recomputing. The trade: recompute spends GPU FLOPs (cheap-ish, prefill is compute-efficient); swap spends PCIe bandwidth (~tens of GB/s against KV that can be GBs) and host RAM. V1 defaults to recompute — short-to-medium contexts re-prefill faster than they copy. Swap wins for very long contexts; that regime is where the disaggregated/offload designs of Phase 15 live.
- Why the most-recently-admitted victim? Least progress lost (it has computed the least), and under FCFS it's the lowest-priority commitment. Preempting the oldest would maximize wasted work and starve the request closest to finishing — note that the survivor finishing is what frees memory. Victim selection isn't fairness aesthetics; it's part of the forward-progress argument.
- The deadlock question (interviewers love it): what if no request can finish
because none fits alone? Then preemption ping-pongs forever — A evicts B, B evicts A.
Prevention is an admission-time invariant: a single max-length request must fit in
the pool. That's exactly the startup check you met in Phase 1 lab-02 (the engine
refusing
max_model_lentoo big for its blocks) — the safety valve works only because another component made a promise. Cross-component invariants like this are what design docs are for. - Operationally, preemption is a smell, not a feature. Each one re-prefills a whole
request — visible as a TTFT/ITL spike for the victim and burned throughput for everyone.
vLLM logs a warning with a preemption counter (
vllm:num_preemptions_totalin metrics); a rate of preemptions means your pool is undersized for your workload: lowermax_num_seqs, shortenmax_model_len, raisegpu_memory_utilization, or buy HBM. The valve existing doesn't make leaning on it free.
Going further
- Instrument
num_preemptionsper request (the field already exists) across a sweep of pool sizes from 5 to 50 blocks for a fixed 4-request workload. Plot total steps vs pool size — you'll get a hockey stick whose knee is "enough memory," the capacity-planning picture from the memory side. - Change the victim policy to oldest-first and rerun the suite. The correctness test still passes (replay is policy-independent — make sure you can say why), but count total steps: you've measured the cost of a bad policy with the safety net intact.
- Add a
swapmode tomini_vllm(stash the victim's per-token "KV" — here just its counter state — instead of resetting) and make the correctness test pass both modes. You'll discover the bookkeeping subtleties (what if the swapped request's blocks were shared via prefix cache?) that make real swap implementations hairy.
References
mini_vllm/scheduler.py::_preemptand the RUNNING phase's allocate-or-preempt loop — the dozen lines this lab is about.upstream/vllm/v1/core/sched/scheduler.py— searchpreempt: same dance, plus priority-aware victim selection and the preemption-mode plumbing.- Kwon et al., PagedAttention (SOSP 2023), §4.5 — preemption via recompute vs swap, with measurements: https://arxiv.org/abs/2309.06180
- vLLM docs, Optimization and Tuning — the official "reduce preemptions" guidance and the warning log you'll see in production: https://docs.vllm.ai/en/latest/configuration/optimization.html
- Phase 2 lab-02 — the over-reservation waste that justifies optimistic admission in the first place.