Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 15-01 — KV Handoff: Move a Live Request Between Engines [CPU-OK]

Everything in this course so far assumed a request lives and dies on one engine. Disaggregated serving breaks that assumption on purpose: engine P (tuned for compute-hungry prefill) processes the prompt, then the request — its token state and its computed-KV claim — migrates to engine D (tuned for bandwidth-hungry decode), which continues as if nothing happened. This lab implements the migration's bookkeeping on two mini_vllm engines: export_request (snapshot + release the source), import_request (resurrect + claim KV blocks at the destination), and the proof that justifies the whole architecture — the migrated request's final output is token-for-token identical to never having moved. Plus the two operational truths migrations live with: the source must come back clean (every block freed), and the destination must pay the KV bill up front — loudly failing if it can't.

Contents


Why this lab exists

The deep observation behind this lab — and behind Phase 3 lab-04's preemption before it — is that a request's entire transferable identity is small and explicit: prompt ids, output ids, num_computed_tokens, sampling params, and (the only heavy part) the KV those counters claim. Preemption exploited that by discarding the KV and recomputing; handoff exploits it by shipping the KV and not. Same state machine, two recovery strategies — and your import_request is structurally Scheduler-admission code (allocate, set counters, mark RUNNING), because migration is admission with prepaid compute. Once you see migration this way, the production machinery (KV connectors, NIXL, multi-engine routing — Phase 15's deep-dive) reads as transport details around bookkeeping you've already written twice.

The identical-output proof matters operationally, not just aesthetically: P/D deployments route some requests through the split path and others not (short prompts often stay colocated). If migration changed outputs, the same request would answer differently depending on an infrastructure routing decision — an unacceptable, undebuggable property. The test suite makes it impossible.

Background: what actually moves

The honest accounting of a migration, in order:

  1. Export: snapshot the token state (cheap — a few hundred ints) and the num_computed_tokens claim; remove the request from the source's schedule; free its blocks (the source owes it nothing — test_source_engine_is_clean_after_export pins usage back to 0.0, because a leak here, times thousands of migrations, is an OOM with a delay).
  2. Transfer: in real systems, the KV tensors themselves cross the wire — lab-03 prices this (256 MiB for a 2048-token prompt on an 8B; the freight is the whole economics). In mini_vllm, the toy model never reads KV values, so the transfer carries metadata only — which is precisely why the lab can isolate the bookkeeping correctness from the transport.
  3. Import: allocate destination blocks for the computed tokens (Phase 2's ceil-div bill, paid in D's pool — test_destination_pays_the_kv_bill counts it exactly), set the counter, mark RUNNING, join the schedule. The connector would now fill those blocks with the shipped tensors; decoding resumes either way.

Note what makes step 3 legal without recomputation, in contrast to preemption's reset-to-zero: the claim "these num_computed_tokens tokens have valid KV" is now backed by the transfer rather than by local compute. The counter doesn't care who paid — which is the two-counters model (Phase 1) earning its keep one more time.

Files

  • starter.pyexport_request, import_request, run_to_completion. Your work.
  • solution.py — reference.
  • test_lab.py — identical continuation (post-prefill and mid-decode), source cleanliness, the destination's block bill, and the loud-OOM import.

Run

LAB_IMPL=starter pytest phase-15-disaggregated-serving/labs/lab-01-kv-handoff -q
pytest phase-15-disaggregated-serving/labs/lab-01-kv-handoff -q   # reference

What the tests prove

TestWhat it pins
test_handoff_after_prefill_continues_identicallyThe canonical P/D split (one step = prefill + first token, then migrate): final output ≡ single-engine. Routing decisions must be output-invisible
test_handoff_mid_decode_also_worksMigration is general, not prefill-special — any consistent (tokens, counter) snapshot moves. The mechanism that also underlies decode-to-decode rebalancing
test_source_engine_is_clean_after_exportUsage back to 0.0: the anti-leak invariant. Migration without cleanup is a slow-motion OOM
test_destination_pays_the_kv_billceil(computed/block_size) blocks claimed at D — capacity planning for decode fleets must budget imported KV, not just locally-grown KV
test_destination_oom_is_loudA destination that can't hold the transfer fails at import, not mid-decode — the admission check a router relies on when picking D instances

Hitchhiker's notes

  • Map to upstream: the KV connector interface (upstream/vllm/distributed/kv_transfer/kv_connector/v1/) is your export/import with tensors attached — get_num_new_matched_tokens (what can the destination receive?), the worker-side send/recv of block contents, and scheduler hooks that overlap transfer with compute. Connectors ship for shared storage (LMCache), point-to-point (NIXL/P2P), and more — transport varies, your bookkeeping shape doesn't.
  • The real subtlety production adds is asynchrony: D starts allocating and even scheduling while KV is still in flight, attention must not read blocks the transfer hasn't filled — a readiness-tracking problem your synchronous lab dodges on purpose. When you read connector code, most of its complexity is exactly this fence; the synchronous core underneath is this lab.
  • Block identity does not survive migration — P's block 47 becomes whatever D's pool hands out; only the logical token order matters, and the block table rebuild is free because tables are per-engine metadata (Phase 2). Anyone who tries to ship block ids instead of block contents has misunderstood the indirection — a surprisingly common design-review catch.
  • Prefix caching composes: if D already holds cached blocks for the prompt's prefix (another request warmed it), the transfer can skip those — connectors literally consult get_computed_blocks to shrink the freight. Phase 2 lab-05's machinery, now saving network bytes instead of FLOPs.

Going further

  • Make the import prefix-cache-aware: enable caching in D, pre-warm it with the same prompt, and extend import_request to claim cached blocks first (via get_computed_blocks) and allocate only the remainder — measure the freight saved. You've implemented the connector's matched-tokens optimization.
  • Build a tiny router: N decode engines, route each import to the one with the most free blocks; assert no import ever OOMs under a workload where round-robin would. Phase 11 lab-04's admission thinking, fleet edition.
  • Simulate the failure path: export, "lose" the payload, and re-run the request from scratch on D — preemption-style recompute as the fallback when transfer fails. Note that correctness needs nothing new: the request's identity is still just tokens. (This is why P/D systems can degrade gracefully to colocated.)

References

  • upstream/vllm/distributed/kv_transfer/kv_connector/v1/ — the connector interface and implementations (NIXL, shared-storage, multi-connector).
  • vLLM docs, Disaggregated Prefilling — the deployment shape this lab's bookkeeping serves: https://docs.vllm.ai/en/latest/features/disagg_prefill/
  • Zhong et al., DistServe: Disaggregating Prefill and Decoding for Goodput- optimized LLM Serving (OSDI 2024) — the why (lab-03 prices it): https://arxiv.org/abs/2401.09670
  • Phase 3 lab-04 — the discard-and-recompute sibling of this lab's ship-and-continue; Phase 1 — the two counters that make both legal.