Phase 15 Labs — Disaggregated Serving
Three labs on splitting the workload where Phase 10 split the model: prefill on machines built for compute, decode on machines built for bandwidth, a request's KV shipped between them. The arc: build the migration bookkeeping and prove it output-invisible (lab-01), price the trade — transfer toll vs interference win, verdicts flipping with the wire (lab-03), then assemble a real producer/consumer pair and watch p99 ITL collapse 3× while TTFT pays its 10% (lab-02).
Recommended order: 01 → 03 → 02. CPU labs follow the standard contract —
starter.py (your work), solution.py (reference), test_lab.py (the spec); default
runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-15-disaggregated-serving/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-15-disaggregated-serving/labs/lab-01-kv-handoff -q
Contents
- lab-01-kv-handoff
[CPU-OK] - lab-02-pd-pair
[GPU-OPT] - lab-03-disagg-economics
[CPU-OK] - What you can do after this phase
Labs
lab-01-kv-handoff [CPU-OK]
Move a live request between two engines: export (snapshot + free the source — usage back to 0.0, the anti-leak invariant), import (claim destination blocks, loudly OOM if they don't exist), and the proof that justifies the architecture — the migrated request's output is token-for-token identical to never moving. Migration revealed as admission-with-prepaid-compute, preemption's ship-instead-of-discard sibling. Skills: a request's transferable identity; the two recovery strategies; block identity doesn't survive (contents do); why routing must be output-invisible.
lab-02-pd-pair [GPU-OPT]
The real system: producer + consumer instances joined by a KV connector, a proxy
running the max_tokens=1 handoff, and the two predicted signatures measured —
TTFT +10% (the toll), ITL p99 38 → 12 ms (the interference, gone; p50 untouched,
because interference was always a tail phenomenon). Annotated capture included.
Skills: kv_role/connector configuration; both-sides-must-agree hazards; failure
drills and graceful degradation; tails are what you're buying.
lab-03-disagg-economics [CPU-OK]
The trade in five functions: 256 MiB of KV freight per 2048-token 8B prompt — ~11 ms on InfiniBand-class fabric (invisible) vs ~215 ms on 10 GbE (doubles TTFT) — against the interference win from Phase 3 lab-05's spike. The decision function says yes, no-slow-link, and no-no-disease, each pinned by a test. Skills: the penalty ratio as the qualifying number; bits-vs-bytes; per-token-tax → per-request-toll as a pattern; KV compression as a topology enabler.
What you can do after this phase
Decide, from your cluster's fabric and your traffic's prompt/decode shape, whether disaggregation pays — and say which metric it buys (p99 ITL) and which it taxes (TTFT) with numbers; implement and review KV-transfer bookkeeping with the invariants drilled here (source clean, destination billed, OOM loud, output invisible); and stand up, configure, and failure-drill a real P/D pair. Combined with Phase 10, you now hold both axes of scale-out: split the model, split the workload — Phase 18 teaches you to measure which one your bottleneck wants.