Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 15 Labs — Disaggregated Serving

Three labs on splitting the workload where Phase 10 split the model: prefill on machines built for compute, decode on machines built for bandwidth, a request's KV shipped between them. The arc: build the migration bookkeeping and prove it output-invisible (lab-01), price the trade — transfer toll vs interference win, verdicts flipping with the wire (lab-03), then assemble a real producer/consumer pair and watch p99 ITL collapse 3× while TTFT pays its 10% (lab-02).

Recommended order: 01 → 03 → 02. CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-15-disaggregated-serving/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-15-disaggregated-serving/labs/lab-01-kv-handoff -q

Contents


Labs

lab-01-kv-handoff [CPU-OK]

Move a live request between two engines: export (snapshot + free the source — usage back to 0.0, the anti-leak invariant), import (claim destination blocks, loudly OOM if they don't exist), and the proof that justifies the architecture — the migrated request's output is token-for-token identical to never moving. Migration revealed as admission-with-prepaid-compute, preemption's ship-instead-of-discard sibling. Skills: a request's transferable identity; the two recovery strategies; block identity doesn't survive (contents do); why routing must be output-invisible.

lab-02-pd-pair [GPU-OPT]

The real system: producer + consumer instances joined by a KV connector, a proxy running the max_tokens=1 handoff, and the two predicted signatures measured — TTFT +10% (the toll), ITL p99 38 → 12 ms (the interference, gone; p50 untouched, because interference was always a tail phenomenon). Annotated capture included. Skills: kv_role/connector configuration; both-sides-must-agree hazards; failure drills and graceful degradation; tails are what you're buying.

lab-03-disagg-economics [CPU-OK]

The trade in five functions: 256 MiB of KV freight per 2048-token 8B prompt — ~11 ms on InfiniBand-class fabric (invisible) vs ~215 ms on 10 GbE (doubles TTFT) — against the interference win from Phase 3 lab-05's spike. The decision function says yes, no-slow-link, and no-no-disease, each pinned by a test. Skills: the penalty ratio as the qualifying number; bits-vs-bytes; per-token-tax → per-request-toll as a pattern; KV compression as a topology enabler.

What you can do after this phase

Decide, from your cluster's fabric and your traffic's prompt/decode shape, whether disaggregation pays — and say which metric it buys (p99 ITL) and which it taxes (TTFT) with numbers; implement and review KV-transfer bookkeeping with the invariants drilled here (source clean, destination billed, OOM loud, output invisible); and stand up, configure, and failure-drill a real P/D pair. Combined with Phase 10, you now hold both axes of scale-out: split the model, split the workload — Phase 18 teaches you to measure which one your bottleneck wants.