Lab 15-01 — KV Handoff: Move a Live Request Between Engines [CPU-OK]
Everything in this course so far assumed a request lives and dies on one engine.
Disaggregated serving breaks that assumption on purpose: engine P (tuned for
compute-hungry prefill) processes the prompt, then the request — its token state and
its computed-KV claim — migrates to engine D (tuned for bandwidth-hungry
decode), which continues as if nothing happened. This lab implements the migration's
bookkeeping on two mini_vllm engines: export_request (snapshot + release the
source), import_request (resurrect + claim KV blocks at the destination), and the
proof that justifies the whole architecture — the migrated request's final output is
token-for-token identical to never having moved. Plus the two operational truths
migrations live with: the source must come back clean (every block freed), and the
destination must pay the KV bill up front — loudly failing if it can't.
Contents
- Why this lab exists
- Background: what actually moves
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
The deep observation behind this lab — and behind Phase 3 lab-04's preemption before
it — is that a request's entire transferable identity is small and explicit: prompt
ids, output ids, num_computed_tokens, sampling params, and (the only heavy part)
the KV those counters claim. Preemption exploited that by discarding the KV and
recomputing; handoff exploits it by shipping the KV and not. Same state machine,
two recovery strategies — and your import_request is structurally
Scheduler-admission code (allocate, set counters, mark RUNNING), because migration
is admission with prepaid compute. Once you see migration this way, the production
machinery (KV connectors, NIXL, multi-engine routing — Phase 15's deep-dive) reads as
transport details around bookkeeping you've already written twice.
The identical-output proof matters operationally, not just aesthetically: P/D deployments route some requests through the split path and others not (short prompts often stay colocated). If migration changed outputs, the same request would answer differently depending on an infrastructure routing decision — an unacceptable, undebuggable property. The test suite makes it impossible.
Background: what actually moves
The honest accounting of a migration, in order:
- Export: snapshot the token state (cheap — a few hundred ints) and the
num_computed_tokensclaim; remove the request from the source's schedule; free its blocks (the source owes it nothing —test_source_engine_is_clean_after_exportpins usage back to 0.0, because a leak here, times thousands of migrations, is an OOM with a delay). - Transfer: in real systems, the KV tensors themselves cross the wire — lab-03
prices this (256 MiB for a 2048-token prompt on an 8B; the freight is the whole
economics). In
mini_vllm, the toy model never reads KV values, so the transfer carries metadata only — which is precisely why the lab can isolate the bookkeeping correctness from the transport. - Import: allocate destination blocks for the computed tokens (Phase 2's
ceil-div bill, paid in D's pool —
test_destination_pays_the_kv_billcounts it exactly), set the counter, mark RUNNING, join the schedule. The connector would now fill those blocks with the shipped tensors; decoding resumes either way.
Note what makes step 3 legal without recomputation, in contrast to preemption's
reset-to-zero: the claim "these num_computed_tokens tokens have valid KV" is now
backed by the transfer rather than by local compute. The counter doesn't care who
paid — which is the two-counters model (Phase 1) earning its keep one more time.
Files
starter.py—export_request,import_request,run_to_completion. Your work.solution.py— reference.test_lab.py— identical continuation (post-prefill and mid-decode), source cleanliness, the destination's block bill, and the loud-OOM import.
Run
LAB_IMPL=starter pytest phase-15-disaggregated-serving/labs/lab-01-kv-handoff -q
pytest phase-15-disaggregated-serving/labs/lab-01-kv-handoff -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_handoff_after_prefill_continues_identically | The canonical P/D split (one step = prefill + first token, then migrate): final output ≡ single-engine. Routing decisions must be output-invisible |
test_handoff_mid_decode_also_works | Migration is general, not prefill-special — any consistent (tokens, counter) snapshot moves. The mechanism that also underlies decode-to-decode rebalancing |
test_source_engine_is_clean_after_export | Usage back to 0.0: the anti-leak invariant. Migration without cleanup is a slow-motion OOM |
test_destination_pays_the_kv_bill | ceil(computed/block_size) blocks claimed at D — capacity planning for decode fleets must budget imported KV, not just locally-grown KV |
test_destination_oom_is_loud | A destination that can't hold the transfer fails at import, not mid-decode — the admission check a router relies on when picking D instances |
Hitchhiker's notes
- Map to upstream: the KV connector interface
(
upstream/vllm/distributed/kv_transfer/kv_connector/v1/) is your export/import with tensors attached —get_num_new_matched_tokens(what can the destination receive?), the worker-side send/recv of block contents, and scheduler hooks that overlap transfer with compute. Connectors ship for shared storage (LMCache), point-to-point (NIXL/P2P), and more — transport varies, your bookkeeping shape doesn't. - The real subtlety production adds is asynchrony: D starts allocating and even scheduling while KV is still in flight, attention must not read blocks the transfer hasn't filled — a readiness-tracking problem your synchronous lab dodges on purpose. When you read connector code, most of its complexity is exactly this fence; the synchronous core underneath is this lab.
- Block identity does not survive migration — P's block 47 becomes whatever D's pool hands out; only the logical token order matters, and the block table rebuild is free because tables are per-engine metadata (Phase 2). Anyone who tries to ship block ids instead of block contents has misunderstood the indirection — a surprisingly common design-review catch.
- Prefix caching composes: if D already holds cached blocks for the prompt's
prefix (another request warmed it), the transfer can skip those — connectors
literally consult
get_computed_blocksto shrink the freight. Phase 2 lab-05's machinery, now saving network bytes instead of FLOPs.
Going further
- Make the import prefix-cache-aware: enable caching in D, pre-warm it with the
same prompt, and extend
import_requestto claim cached blocks first (viaget_computed_blocks) and allocate only the remainder — measure the freight saved. You've implemented the connector's matched-tokens optimization. - Build a tiny router: N decode engines, route each import to the one with the most free blocks; assert no import ever OOMs under a workload where round-robin would. Phase 11 lab-04's admission thinking, fleet edition.
- Simulate the failure path: export, "lose" the payload, and re-run the request from scratch on D — preemption-style recompute as the fallback when transfer fails. Note that correctness needs nothing new: the request's identity is still just tokens. (This is why P/D systems can degrade gracefully to colocated.)
References
upstream/vllm/distributed/kv_transfer/kv_connector/v1/— the connector interface and implementations (NIXL, shared-storage, multi-connector).- vLLM docs, Disaggregated Prefilling — the deployment shape this lab's bookkeeping serves: https://docs.vllm.ai/en/latest/features/disagg_prefill/
- Zhong et al., DistServe: Disaggregating Prefill and Decoding for Goodput- optimized LLM Serving (OSDI 2024) — the why (lab-03 prices it): https://arxiv.org/abs/2401.09670
- Phase 3 lab-04 — the discard-and-recompute sibling of this lab's ship-and-continue; Phase 1 — the two counters that make both legal.