Lab 10-03 — The TP Communication Bill [CPU-OK]
Lab-01 proved tensor parallelism is mathematically free — exact reconstruction, one all-reduce per block. This lab prices what "one all-reduce" costs physically, and the answer derives the most-quoted deployment rule in distributed inference from four multiplications: TP within a node, never across. Same model, same math, same code — on NVLink the communication is noise (<10% of a decode step), on 10 GbE it's fatal (>40%, latency alone). You'll also derive the subtler corollary most people miss: for decode, the bill is dominated by latency, not bandwidth — 64 tiny 8 KB messages per token — which is why fancy interconnect bandwidth numbers don't save cross-node TP and why prefill and decode want different links.
Contents
- Why this lab exists
- Background: what gets sent, how often, and how
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
"TP needs fast interconnect" is folklore until you can compute how fast, for which workload, and what happens if you ignore it. Those computations decide real money: whether a 70B model needs an NVLink-equipped node or can spread across two cheaper ones (it can't — not with TP; that's what PP is for, lab-04), whether TP=8 beats TP=4 for your latency target, whether a cloud's "high-bandwidth networking" claim is relevant (check the latency; for decode it usually matters more). This lab builds the five-function model that answers all of them on a napkin — the distributed sibling of Phase 0 lab-04's roofline, and like it, a model whose domain of validity you'll know because you built it.
It's also the quantitative half of a design story the phase tells in two parts: TP (this lab) pays communication per layer and demands fat links but splits every matrix; PP (lab-04) pays per stage boundary and tolerates thin links but idles GPUs in bubbles. Every real deployment of a big model is a negotiation between these two bills, and you're about to be able to compute both sides.
Background: what gets sent, how often, and how
What: after each RowParallelLinear (lab-01), every rank holds a partial sum of
the activation; the all-reduce sums them. Payload = the activation tensor:
batch_tokens × hidden × dtype_bytes. For one decode token of an 8B model: 4096 × 2 =
8 KB. Tiny. For a 2048-token prefill chunk: 16 MB. Not tiny. Same operation,
three orders of magnitude apart — keep both numbers in mind; they split the analysis.
How often: twice per layer (attention out-proj, MLP down-proj) × 32 layers = 64 all-reduces per step, every step, forever. Communication frequency is set by model depth, not by anything you can tune.
How: ring all-reduce — reduce-scatter then all-gather, each rank sending
2·(N−1)/N × payload total across 2(N−1) sequential hops. The formula's two terms
are the lab's two regimes: traffic / bandwidth (dominates for big payloads:
prefill) and 2(N−1) × latency (dominates for small ones: decode). A 3 µs NVLink hop
vs a 50 µs Ethernet round-trip is the 17× that, multiplied by 64 all-reduces, becomes
the node boundary.
Files
starter.py—allreduce_payload_bytes,ring_allreduce_traffic_per_rank,allreduce_time_s,tp_comm_time_per_step,comm_fraction. Your work.solution.py— reference.test_lab.py— the formulas, the NVLink-vs-Ethernet verdict, both latency/bandwidth regimes, and the more-ranks-more-overhead direction.
Run
LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-03-tp-comm-cost -q
pytest phase-10-distributed-inference/labs/lab-03-tp-comm-cost -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_payload_is_one_activation_row_per_token | The 8 KB decode payload — memorize it; it's why decode TP is a latency problem |
test_ring_traffic_formula | 2(N−1)/N: at N=2 each rank moves exactly one payload; as N grows it approaches 2× — traffic per rank is nearly constant in N (the ring's genius), it's the hop count that grows |
test_decode_step_comm_on_nvlink_is_noise | 64 all-reduces on NVLink < 1 ms, < 10% of the 8 ms decode step (Phase 0 lab-04's number) — TP=2 in-node is nearly free |
test_decode_step_comm_on_ethernet_is_fatal | Same step over 10 GbE: latency alone is 64 × 2 × 50 µs = 6.4 ms, > 40% of the step. The "never TP across nodes" rule, as an assert |
test_latency_dominates_small_payloads / test_prefill_payloads_shift_the_balance_to_bandwidth | The regime split: for decode, halving latency beats doubling bandwidth; for prefill, the reverse. One model, two correct answers to "what should we buy?" |
test_more_ranks_more_overhead | TP=8 > TP=2 in comm time: TP scaling is sub-linear by construction, before any software inefficiency |
Hitchhiker's notes
- Why TP at all, if it taxes every layer? Three reasons, in order of importance:
the model doesn't fit on one GPU (the usual one); per-token latency — TP divides
the weight-streaming time, so a bandwidth-bound decode step (Phase 0 lab-04's 8 ms)
genuinely drops toward 8/N ms + comm, the only lever that shortens single-stream
ITL on a too-slow GPU; and KV capacity — the cache splits across ranks too
(lab-02's halved
# GPU blocksper worker is per-rank; total capacity grows). The comm bill is what you pay for all three. - vLLM's custom all-reduce: for small payloads (exactly the decode case), NCCL's
general ring is beaten by a one-shot fused kernel over NVLink peer access —
upstream/vllm/distributed/device_communicators/custom_all_reduce.pyexists precisely because of the latency term you just modeled. When you read "custom allreduce disabled" in a startup log, you now know which workloads care. - The model's omissions (know them before quoting it): overlap — real engines overlap some comm with compute, shaving the visible fraction; NVSwitch topology — 8-GPU nodes all-reduce at near-constant time rather than ring-scaling; and cross-node fabrics like InfiniBand (~2–5 µs, 50–400 Gb/s) sit between your NVLink and Ethernet endpoints — rerun the numbers for IB and you'll see why cross-node TP is merely painful rather than absurd on real clusters, and why it's still avoided when PP can serve.
- Hidden size moves the bill linearly — a 70B model (hidden 8192) doubles every payload, and its compute per step is ~9× bigger; comm fraction actually improves with model size. Small models are the worst TP candidates twice over (less to split, same hop count).
Going further
- Add an
overlap_fractionparameter (comm hidden under compute) and find the break-even overlap that makes TP=2-over-IB match TP=2-over-NVLink for decode. You've quantified what async/overlapped all-reduce engineering is worth. - Model TP × batch: comm payload grows with batch but compute grows too — plot comm fraction vs batch size for decode and find where Ethernet TP becomes tolerable (large-batch throughput serving — which is exactly when you didn't need TP's latency win anyway; the conclusion writes itself).
- Compute the bill for expert parallelism's all-to-all (Phase 7 lab-04's missing line): payload = routed tokens × hidden, frequency = 2 per MoE layer. Compare against TP's — you'll see why DeepSeek-scale MoE deployments obsess over network topology in a way dense-model TP never had to.
References
upstream/vllm/distributed/device_communicators/custom_all_reduce.py— the latency-term workaround, in production.- NVIDIA NCCL docs, Collective Operations — ring/tree algorithms and their cost models: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html
- Shoeybi et al., Megatron-LM (2019) — the column→row TP scheme and its two all-reduces per layer: https://arxiv.org/abs/1909.08053
- Pope et al., Efficiently Scaling Transformer Inference (2022) — §3's communication analysis, the rigorous version of this lab: https://arxiv.org/abs/2211.05102
- Lab-01 — the math being priced; lab-04 — the alternative with the opposite bill.