Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 15-03 — The Disaggregation Trade: Transfer Bills vs Interference Wins [CPU-OK]

Why run prefill and decode on different machines when colocated chunked prefill (Phase 3) already works? Because chunking only caps the interference — every decode step that shares a batch with a prefill chunk still pays for it (the [33, 33, …] profile from Phase 3 lab-05), and at scale that cap is your ITL p99. Disaggregation buys perfectly clean decode steps — and pays by shipping the prompt's KV across a wire, straight into TTFT. This lab prices both sides in five functions and lands the punchline numbers: a 2048-token prompt on an 8B is 256 MiB of freight — ~11 ms over an InfiniBand-class link (invisible inside a ~205 ms prefill) versus ~215 ms over 10 GbE (doubling TTFT). Same architecture, opposite verdicts, decided entirely by the wire.

Contents


Why this lab exists

Disaggregation is the most hyped serving architecture of the moment, which is exactly when an engineer needs the arithmetic most — to know when it's transformative (latency-SLO products with long prompts, fleets big enough to pool P and D capacity separately) and when it's cargo cult (short prompts, slow links, or workloads whose interference a tuned chunk threshold already handles). The five functions you'll write are the meeting-room version of the DistServe paper's argument, and the decision function's three test cases are the three deployments you'll actually encounter: heavy interference + fast link (split), heavy interference + slow link (the cure costs more than the disease), and negligible interference (why bother).

The deeper pattern — the course's economics-lab family (Phase 0 lab-02, Phase 8 lab-04, Phase 11 lab-03, Phase 10 lab-03) — closes here with its cleanest specimen: one latency line item moved from a per-token tax (interference on every decode step) to a per-request toll (transfer once into TTFT). Whether that's a good trade depends on tokens-per-request and the toll rate; everything else is detail.

Background: the two ledgers

What disaggregation buys — decode steps that never share a batch with prefill: worst-case ITL drops from decode_step + chunk_time (Phase 3 lab-05's spike, capped but real) to decode_step, clean. For a 10 ms step under 25 ms chunks, that's a 3.5× p99 improvement — and each fleet can now be sized, scheduled, and even hardware-chosen for its own regime (prefill is compute-bound, decode bandwidth-bound — Phase 0 lab-04's split, finally given separate machines).

What it costs — the prompt's entire KV crosses a wire: prompt_tokens × kv_bytes_per_token (Phase 0 lab-02's 128 KiB/token for an 8B; 2.5× that for a 70B — test_payload_scales_with_model_not_just_prompt). The transfer lands in TTFT, and the right way to judge it is relative: transfer_time / prefill_time. Both scale ~linearly with prompt length, so the ratio is roughly constant per (model, link) — ~5% on a 200 Gb/s fabric (invisible), >100% on 10 GbE (the transfer outweighs the prefill it's delivering). That ratio is the single number that qualifies or disqualifies a cluster for P/D — compute it before the design review, not after the deployment.

Mind the unit trap the tests enforce: links are quoted in gigabits; KV comes in bytes. The factor of 8 has embarrassed real capacity plans.

Files

  • starter.pykv_payload_bytes, transfer_seconds, colocated_itl_worst, disagg_ttft_penalty, disagg_wins. Your work.
  • solution.py — reference.
  • test_lab.py — the freight, both link verdicts, the interference identity, the penalty fractions, the three-way decision, and the model-size scaling.

Run

LAB_IMPL=starter pytest phase-15-disaggregated-serving/labs/lab-03-disagg-economics -q
pytest phase-15-disaggregated-serving/labs/lab-03-disagg-economics -q   # reference

What the tests prove

TestWhat it pins
test_payload_is_real_freight256 MiB per 2048-token request — per request, every request. KV transfer is a bandwidth product, not a control message
test_link_speed_is_the_whole_story~11 ms vs ~215 ms for the same payload: the fabric is the feasibility condition, with the bits-vs-bytes factor of 8 enforced
test_interference_math_is_phase3_lab05The colocated worst case is literally that lab's spike, in seconds
test_ttft_penalty_fractions<6% on fast fabric, >100% on 10 GbE — the qualifying ratio
test_the_decision_both_waysAll three real deployments: split / don't (slow link) / don't (no disease to cure). A decision function that can say "no" in two different ways is one you can trust
test_payload_scales_with_model_not_just_promptThe 70B multiplier: bigger models raise the freight and (via slower prefill) the budget — rerun the ratio per model, never reuse it

Hitchhiker's notes

  • GQA/MLA shrink the freight tookv_bytes_per_token is Phase 0 lab-02's formula, so every KV-compression technique (Phase 6's FP8-KV included) is also a disaggregation enabler. DeepSeek's MLA (≈ 70 KiB/token at svelte) makes P/D dramatically cheaper to feed — architecture choices propagate into deployment topology, which is the kind of cross-layer effect staff engineers are paid to notice.
  • Overlap hides part of the toll: real connectors stream KV layer-by-layer while prefill still computes later layers, so the visible TTFT penalty can be a fraction of your transfer_seconds. The model is an upper bound with a known bias — the most useful kind (Phase 8 lab-04's phrasing, still true).
  • The hidden third ledger is utilization: separate fleets can each run their regime's optimal batch shape (prefill: few huge batches; decode: many small steady ones) instead of compromising — DistServe's "goodput" argument, which can dominate both latency ledgers at scale. Your model prices latency; remember the throughput term exists before declaring a verdict from latency alone.
  • The degenerate fallback matters: when the link is slow or the prompt short, routing the request colocated (no migration) costs nothing — P/D systems are hybrid by construction (lab-01's output-invariance is what makes per-request routing safe). The decision function runs per request class, not per cluster.

Going further

  • Add overlap: effective_transfer(transfer_s, prefill_s, overlap_fraction) and find the overlap that makes 25 GbE viable for 2048-token prompts. You've priced what connector engineering is worth (compare Phase 10 lab-03's same move for all-reduce).
  • Sweep prompt length 128 → 32k and plot both ledgers: the interference win grows with prompt length (bigger chunks to dodge) and the freight grows — but the penalty ratio stays flat while the ITL win grows. Long-context workloads are disaggregation's home turf; the plot shows why in one figure.
  • Add the queueing term: P-fleet utilization → prefill queue wait → TTFT. At high load, disaggregation's pooling effect (any P serves any D) cuts queue waits — the goodput argument made visible with an M/M/1 sketch.

References

  • Zhong et al., DistServe (OSDI 2024) — the goodput argument and the interference/transfer trade formalized: https://arxiv.org/abs/2401.09670
  • Patel et al., Splitwise (2024) — the same split from the hardware-heterogeneity angle: https://arxiv.org/abs/2311.18677
  • upstream/vllm/distributed/kv_transfer/ — where the freight actually ships (lab-01's bookkeeping + transport).
  • Phase 3 lab-05 — the interference this architecture deletes; Phase 0 labs 02/04 — the per-token bytes and the regime split that make both ledgers computable.