Lab 15-03 — The Disaggregation Trade: Transfer Bills vs Interference Wins [CPU-OK]
Why run prefill and decode on different machines when colocated chunked prefill
(Phase 3) already works? Because chunking only caps the interference — every decode
step that shares a batch with a prefill chunk still pays for it (the [33, 33, …]
profile from Phase 3 lab-05), and at scale that cap is your ITL p99.
Disaggregation buys perfectly clean decode steps — and pays by shipping the prompt's
KV across a wire, straight into TTFT. This lab prices both sides in five functions
and lands the punchline numbers: a 2048-token prompt on an 8B is 256 MiB of
freight — ~11 ms over an InfiniBand-class link (invisible inside a ~205 ms prefill)
versus ~215 ms over 10 GbE (doubling TTFT). Same architecture, opposite verdicts,
decided entirely by the wire.
Contents
- Why this lab exists
- Background: the two ledgers
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Disaggregation is the most hyped serving architecture of the moment, which is exactly when an engineer needs the arithmetic most — to know when it's transformative (latency-SLO products with long prompts, fleets big enough to pool P and D capacity separately) and when it's cargo cult (short prompts, slow links, or workloads whose interference a tuned chunk threshold already handles). The five functions you'll write are the meeting-room version of the DistServe paper's argument, and the decision function's three test cases are the three deployments you'll actually encounter: heavy interference + fast link (split), heavy interference + slow link (the cure costs more than the disease), and negligible interference (why bother).
The deeper pattern — the course's economics-lab family (Phase 0 lab-02, Phase 8 lab-04, Phase 11 lab-03, Phase 10 lab-03) — closes here with its cleanest specimen: one latency line item moved from a per-token tax (interference on every decode step) to a per-request toll (transfer once into TTFT). Whether that's a good trade depends on tokens-per-request and the toll rate; everything else is detail.
Background: the two ledgers
What disaggregation buys — decode steps that never share a batch with prefill:
worst-case ITL drops from decode_step + chunk_time (Phase 3 lab-05's spike, capped
but real) to decode_step, clean. For a 10 ms step under 25 ms chunks, that's a
3.5× p99 improvement — and each fleet can now be sized, scheduled, and even
hardware-chosen for its own regime (prefill is compute-bound, decode
bandwidth-bound — Phase 0 lab-04's split, finally given separate machines).
What it costs — the prompt's entire KV crosses a wire: prompt_tokens × kv_bytes_per_token (Phase 0 lab-02's 128 KiB/token for an 8B; 2.5× that for a 70B —
test_payload_scales_with_model_not_just_prompt). The transfer lands in TTFT, and
the right way to judge it is relative: transfer_time / prefill_time. Both scale
~linearly with prompt length, so the ratio is roughly constant per (model, link) —
~5% on a 200 Gb/s fabric (invisible), >100% on 10 GbE (the transfer outweighs the
prefill it's delivering). That ratio is the single number that qualifies or
disqualifies a cluster for P/D — compute it before the design review, not after the
deployment.
Mind the unit trap the tests enforce: links are quoted in gigabits; KV comes in bytes. The factor of 8 has embarrassed real capacity plans.
Files
starter.py—kv_payload_bytes,transfer_seconds,colocated_itl_worst,disagg_ttft_penalty,disagg_wins. Your work.solution.py— reference.test_lab.py— the freight, both link verdicts, the interference identity, the penalty fractions, the three-way decision, and the model-size scaling.
Run
LAB_IMPL=starter pytest phase-15-disaggregated-serving/labs/lab-03-disagg-economics -q
pytest phase-15-disaggregated-serving/labs/lab-03-disagg-economics -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_payload_is_real_freight | 256 MiB per 2048-token request — per request, every request. KV transfer is a bandwidth product, not a control message |
test_link_speed_is_the_whole_story | ~11 ms vs ~215 ms for the same payload: the fabric is the feasibility condition, with the bits-vs-bytes factor of 8 enforced |
test_interference_math_is_phase3_lab05 | The colocated worst case is literally that lab's spike, in seconds |
test_ttft_penalty_fractions | <6% on fast fabric, >100% on 10 GbE — the qualifying ratio |
test_the_decision_both_ways | All three real deployments: split / don't (slow link) / don't (no disease to cure). A decision function that can say "no" in two different ways is one you can trust |
test_payload_scales_with_model_not_just_prompt | The 70B multiplier: bigger models raise the freight and (via slower prefill) the budget — rerun the ratio per model, never reuse it |
Hitchhiker's notes
- GQA/MLA shrink the freight too —
kv_bytes_per_tokenis Phase 0 lab-02's formula, so every KV-compression technique (Phase 6's FP8-KV included) is also a disaggregation enabler. DeepSeek's MLA (≈ 70 KiB/token at svelte) makes P/D dramatically cheaper to feed — architecture choices propagate into deployment topology, which is the kind of cross-layer effect staff engineers are paid to notice. - Overlap hides part of the toll: real connectors stream KV layer-by-layer
while prefill still computes later layers, so the visible TTFT penalty can be a
fraction of your
transfer_seconds. The model is an upper bound with a known bias — the most useful kind (Phase 8 lab-04's phrasing, still true). - The hidden third ledger is utilization: separate fleets can each run their regime's optimal batch shape (prefill: few huge batches; decode: many small steady ones) instead of compromising — DistServe's "goodput" argument, which can dominate both latency ledgers at scale. Your model prices latency; remember the throughput term exists before declaring a verdict from latency alone.
- The degenerate fallback matters: when the link is slow or the prompt short, routing the request colocated (no migration) costs nothing — P/D systems are hybrid by construction (lab-01's output-invariance is what makes per-request routing safe). The decision function runs per request class, not per cluster.
Going further
- Add overlap:
effective_transfer(transfer_s, prefill_s, overlap_fraction)and find the overlap that makes 25 GbE viable for 2048-token prompts. You've priced what connector engineering is worth (compare Phase 10 lab-03's same move for all-reduce). - Sweep prompt length 128 → 32k and plot both ledgers: the interference win grows with prompt length (bigger chunks to dodge) and the freight grows — but the penalty ratio stays flat while the ITL win grows. Long-context workloads are disaggregation's home turf; the plot shows why in one figure.
- Add the queueing term: P-fleet utilization → prefill queue wait → TTFT. At high load, disaggregation's pooling effect (any P serves any D) cuts queue waits — the goodput argument made visible with an M/M/1 sketch.
References
- Zhong et al., DistServe (OSDI 2024) — the goodput argument and the interference/transfer trade formalized: https://arxiv.org/abs/2401.09670
- Patel et al., Splitwise (2024) — the same split from the hardware-heterogeneity angle: https://arxiv.org/abs/2311.18677
upstream/vllm/distributed/kv_transfer/— where the freight actually ships (lab-01's bookkeeping + transport).- Phase 3 lab-05 — the interference this architecture deletes; Phase 0 labs 02/04 — the per-token bytes and the regime split that make both ledgers computable.