Lab 03-04 — Preemption: Survive Memory Pressure `[CPU-OK]`

Every admission decision the scheduler makes is a bet: "this request will fit." Decode growth makes the bet probabilistic — every running request gets one token longer per step, and nobody knows when they'll stop. Preemption is what happens when the bets go bad: the scheduler evicts a running request mid-generation, frees its KV, and re-runs it later — and the user at the other end must never be able to tell. In this lab you'll force that emergency on purpose, in a pool sized so two requests cannot both finish, and prove the two halves of the contract: a preemption really occurs, and the preempted request's final output is token-for-token identical to what it would have produced with infinite memory.

Memory pressure costs time, never correctness. That's the sentence this lab turns from a slogan into a test.

Why this lab exists
Background: why overcommit at all
The mechanism, step by step
Files
Run
The setup that forces preemption
Why the output is identical
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

Preemption is the scheduler's least-exercised, most-critical path — the firmware of the fire extinguisher. It runs rarely (well-tuned deployments preempt little), which means bugs in it survive for months, and when it finally runs, it runs during the worst possible conditions: full memory, maximum load, every user watching. An engineer who has caused preemptions in a controlled pool, watched the counters reset, and verified the replayed output, debugs a production preemption storm from knowledge; everyone else debugs it from folklore.

There's also a design lesson here worth the price of the lab on its own: vLLM turns a potential correctness catastrophe (OOM mid-generation) into a pure performance event (recompute later). That transformation — push failures down the severity ladder, from "wrong answer" to "crash" to "slow" — is the signature move of robust systems design, and preemption is its cleanest specimen in this codebase.

Background: why overcommit at all

The timid alternative exists: admit a request only if prompt + max_tokens worth of blocks can be reserved up front. No preemption needed, ever. But you built lab Phase-2-02, so you can name the cost: that's max-reservation again through the back door — requests typically generate a fraction of max_tokens, so reserved-but-unused blocks strangle concurrency exactly the way contiguous allocation did. vLLM chooses to admit optimistically (reserve nothing beyond current need, let requests grow block by block) and handle the rare collision with preemption. More throughput every step, plus an occasional recompute tax, beats less throughput always. But optimism requires a safety valve — and the valve must preserve correctness, or the whole bargain is rotten. Hence this lab's two-sided test.

The mechanism, step by step

From mini_vllm/scheduler.py (the RUNNING phase's while True), mirroring upstream:

A running request needs a block for its next token; allocate_slots returns None — the pool is empty. This is the bad bet coming due.
The scheduler picks a victim: the last request in running — the most recently admitted (under FCFS, the lowest-priority / least-progressed; upstream with priority scheduling picks lowest-priority-then-latest). Note the same principle as lab-01: which end of which list you take from IS the policy.
_preempt(victim): free all the victim's blocks (back to the free queue — Phase 2 mechanics), reset num_computed_tokens = 0, keep output_token_ids (this is the crown jewel — see below), status → PREEMPTED, and push it on the front of the waiting queue (it has waited longest; it re-enters first).
Retry the allocation. Repeat — possibly preempting several — until it fits. Degenerate case: the victim is yourself; then you give up this step (you're now first in waiting, and you'll be re-admitted when memory frees).
No admissions on a step that preempted (not out.preempted_req_ids guards the WAITING phase) — re-admitting while evicting would thrash.

Files

starter.py — implement run(prompts, num_blocks, block_size, max_tokens): drive mini_vllm.LLMEngine with a given pool size and return each prompt's output token ids. Deliberately thin — the test design is the lab. Your work.
solution.py — reference.
test_lab.py — (1) cramped-pool outputs == roomy-pool outputs; (2) a direct scheduler test that a preemption actually occurs under pressure.

Run

LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-04-preemption -q
pytest phase-03-continuous-batching-scheduler/labs/lab-04-preemption -q   # reference

The setup that forces preemption

Arithmetic you can check on your fingers — pool of 5 blocks × 4 slots = 20 slots, minus the null block → 16 usable slots. Two requests, each 8 prompt tokens + up to 20 output: each needs 3 blocks just to get past its first decodes (tokens 9..12 spill into block 3). Both admit fine (2 blocks each = 4 of 4 blocks — the optimistic bet). Both decode a few tokens... then one needs its third block. Free blocks: zero. The scheduler preempts the most-recent admission, lets the survivor finish (its blocks free at completion — the reaping path from Phase 1), then re-admits the victim, which re-prefills from scratch and finishes too. Total cost: one extra prefill. Total damage to output: zero — that's the assertion.

The direct scheduler test stages the same squeeze without the engine: schedule once (both admitted), manually advance both requests past their prompts, schedule again — and assert out.preempted_req_ids is non-empty. One test proves the valve opens; the other proves nothing leaks when it does.

Why the output is identical

The victim's output_token_ids survive preemption; only num_computed_tokens resets. On re-admission, the request's token list is prompt + outputs_so_far, and the engine — with zero special-case code — simply sees a request whose counter is far behind and prefills the whole thing, generated tokens included. When the counter catches up, the next sample happens at the same (last_token, position) state as if nothing had happened, and the deterministic model continues identically. Induction does the rest.

Stop and admire the design economy: recovery is just prefill. The two-counters model makes "resume after eviction" indistinguishable from "admit a long prompt" — same code path as lab-02's chunking, same path as lab-06's cache hits. Three features, zero interaction code. When you design state machines, this is what to copy: make recovery a state the normal path already handles, not a parallel universe of special cases. (Real vLLM preserves correctness the same way; with prefix caching on, the recompute may even hit surviving cached blocks of its own prompt and skip most of the work.)

What the tests prove

Test	The half of the contract it pins
`test_cramped_pool_matches_roomy_pool`	Correctness: 5-block pool and 256-block pool produce identical token ids, and both requests reach their full `max_tokens`. The user cannot detect preemption from outputs.
`test_preemption_actually_happens_under_pressure`	Liveness of the test itself: a preemption really fires in this scenario. Without this, the first test could pass vacuously (pool accidentally roomy, nothing preempted, nothing proven). Pair every "X survives Y" test with a "Y actually happened" probe — untriggered safety tests are the unit-test equivalent of an unplugged smoke detector.

Hitchhiker's notes

Recompute vs swap. vLLM can alternatively swap a victim's KV to CPU RAM and copy it back later, instead of recomputing. The trade: recompute spends GPU FLOPs (cheap-ish, prefill is compute-efficient); swap spends PCIe bandwidth (~tens of GB/s against KV that can be GBs) and host RAM. V1 defaults to recompute — short-to-medium contexts re-prefill faster than they copy. Swap wins for very long contexts; that regime is where the disaggregated/offload designs of Phase 15 live.
Why the most-recently-admitted victim? Least progress lost (it has computed the least), and under FCFS it's the lowest-priority commitment. Preempting the oldest would maximize wasted work and starve the request closest to finishing — note that the survivor finishing is what frees memory. Victim selection isn't fairness aesthetics; it's part of the forward-progress argument.
The deadlock question (interviewers love it): what if no request can finish because none fits alone? Then preemption ping-pongs forever — A evicts B, B evicts A. Prevention is an admission-time invariant: a single max-length request must fit in the pool. That's exactly the startup check you met in Phase 1 lab-02 (the engine refusing max_model_len too big for its blocks) — the safety valve works only because another component made a promise. Cross-component invariants like this are what design docs are for.
Operationally, preemption is a smell, not a feature. Each one re-prefills a whole request — visible as a TTFT/ITL spike for the victim and burned throughput for everyone. vLLM logs a warning with a preemption counter (vllm:num_preemptions_total in metrics); a rate of preemptions means your pool is undersized for your workload: lower max_num_seqs, shorten max_model_len, raise gpu_memory_utilization, or buy HBM. The valve existing doesn't make leaning on it free.

Going further

Instrument num_preemptions per request (the field already exists) across a sweep of pool sizes from 5 to 50 blocks for a fixed 4-request workload. Plot total steps vs pool size — you'll get a hockey stick whose knee is "enough memory," the capacity-planning picture from the memory side.
Change the victim policy to oldest-first and rerun the suite. The correctness test still passes (replay is policy-independent — make sure you can say why), but count total steps: you've measured the cost of a bad policy with the safety net intact.
Add a swap mode to mini_vllm (stash the victim's per-token "KV" — here just its counter state — instead of resetting) and make the correctness test pass both modes. You'll discover the bookkeeping subtleties (what if the swapped request's blocks were shared via prefix cache?) that make real swap implementations hairy.

References

mini_vllm/scheduler.py::_preempt and the RUNNING phase's allocate-or-preempt loop — the dozen lines this lab is about.
upstream/vllm/v1/core/sched/scheduler.py — search preempt: same dance, plus priority-aware victim selection and the preemption-mode plumbing.
Kwon et al., PagedAttention (SOSP 2023), §4.5 — preemption via recompute vs swap, with measurements: https://arxiv.org/abs/2309.06180
vLLM docs, Optimization and Tuning — the official "reduce preemptions" guidance and the warning log you'll see in production: https://docs.vllm.ai/en/latest/configuration/optimization.html
Phase 2 lab-02 — the over-reservation waste that justifies optimistic admission in the first place.

vLLM Mastery — From Zero to Maintainer