Lab 10-03 — The TP Communication Bill `[CPU-OK]`

Lab-01 proved tensor parallelism is mathematically free — exact reconstruction, one all-reduce per block. This lab prices what "one all-reduce" costs physically, and the answer derives the most-quoted deployment rule in distributed inference from four multiplications: TP within a node, never across. Same model, same math, same code — on NVLink the communication is noise (<10% of a decode step), on 10 GbE it's fatal (>40%, latency alone). You'll also derive the subtler corollary most people miss: for decode, the bill is dominated by latency, not bandwidth — 64 tiny 8 KB messages per token — which is why fancy interconnect bandwidth numbers don't save cross-node TP and why prefill and decode want different links.

Why this lab exists
Background: what gets sent, how often, and how
Files
Run
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

"TP needs fast interconnect" is folklore until you can compute how fast, for which workload, and what happens if you ignore it. Those computations decide real money: whether a 70B model needs an NVLink-equipped node or can spread across two cheaper ones (it can't — not with TP; that's what PP is for, lab-04), whether TP=8 beats TP=4 for your latency target, whether a cloud's "high-bandwidth networking" claim is relevant (check the latency; for decode it usually matters more). This lab builds the five-function model that answers all of them on a napkin — the distributed sibling of Phase 0 lab-04's roofline, and like it, a model whose domain of validity you'll know because you built it.

It's also the quantitative half of a design story the phase tells in two parts: TP (this lab) pays communication per layer and demands fat links but splits every matrix; PP (lab-04) pays per stage boundary and tolerates thin links but idles GPUs in bubbles. Every real deployment of a big model is a negotiation between these two bills, and you're about to be able to compute both sides.

Background: what gets sent, how often, and how

What: after each RowParallelLinear (lab-01), every rank holds a partial sum of the activation; the all-reduce sums them. Payload = the activation tensor: batch_tokens × hidden × dtype_bytes. For one decode token of an 8B model: 4096 × 2 = 8 KB. Tiny. For a 2048-token prefill chunk: 16 MB. Not tiny. Same operation, three orders of magnitude apart — keep both numbers in mind; they split the analysis.

How often: twice per layer (attention out-proj, MLP down-proj) × 32 layers = 64 all-reduces per step, every step, forever. Communication frequency is set by model depth, not by anything you can tune.

How: ring all-reduce — reduce-scatter then all-gather, each rank sending 2·(N−1)/N × payload total across 2(N−1) sequential hops. The formula's two terms are the lab's two regimes: traffic / bandwidth (dominates for big payloads: prefill) and 2(N−1) × latency (dominates for small ones: decode). A 3 µs NVLink hop vs a 50 µs Ethernet round-trip is the 17× that, multiplied by 64 all-reduces, becomes the node boundary.

Files

starter.py — allreduce_payload_bytes, ring_allreduce_traffic_per_rank, allreduce_time_s, tp_comm_time_per_step, comm_fraction. Your work.
solution.py — reference.
test_lab.py — the formulas, the NVLink-vs-Ethernet verdict, both latency/bandwidth regimes, and the more-ranks-more-overhead direction.

Run

LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-03-tp-comm-cost -q
pytest phase-10-distributed-inference/labs/lab-03-tp-comm-cost -q   # reference

What the tests prove

Test	What it pins
`test_payload_is_one_activation_row_per_token`	The 8 KB decode payload — memorize it; it's why decode TP is a latency problem
`test_ring_traffic_formula`	`2(N−1)/N`: at N=2 each rank moves exactly one payload; as N grows it approaches 2× — traffic per rank is nearly constant in N (the ring's genius), it's the hop count that grows
`test_decode_step_comm_on_nvlink_is_noise`	64 all-reduces on NVLink < 1 ms, < 10% of the 8 ms decode step (Phase 0 lab-04's number) — TP=2 in-node is nearly free
`test_decode_step_comm_on_ethernet_is_fatal`	Same step over 10 GbE: latency alone is 64 × 2 × 50 µs = 6.4 ms, > 40% of the step. The "never TP across nodes" rule, as an assert
`test_latency_dominates_small_payloads` / `test_prefill_payloads_shift_the_balance_to_bandwidth`	The regime split: for decode, halving latency beats doubling bandwidth; for prefill, the reverse. One model, two correct answers to "what should we buy?"
`test_more_ranks_more_overhead`	TP=8 > TP=2 in comm time: TP scaling is sub-linear by construction, before any software inefficiency

Hitchhiker's notes

Why TP at all, if it taxes every layer? Three reasons, in order of importance: the model doesn't fit on one GPU (the usual one); per-token latency — TP divides the weight-streaming time, so a bandwidth-bound decode step (Phase 0 lab-04's 8 ms) genuinely drops toward 8/N ms + comm, the only lever that shortens single-stream ITL on a too-slow GPU; and KV capacity — the cache splits across ranks too (lab-02's halved # GPU blocks per worker is per-rank; total capacity grows). The comm bill is what you pay for all three.
vLLM's custom all-reduce: for small payloads (exactly the decode case), NCCL's general ring is beaten by a one-shot fused kernel over NVLink peer access — upstream/vllm/distributed/device_communicators/custom_all_reduce.py exists precisely because of the latency term you just modeled. When you read "custom allreduce disabled" in a startup log, you now know which workloads care.
The model's omissions (know them before quoting it): overlap — real engines overlap some comm with compute, shaving the visible fraction; NVSwitch topology — 8-GPU nodes all-reduce at near-constant time rather than ring-scaling; and cross-node fabrics like InfiniBand (~2–5 µs, 50–400 Gb/s) sit between your NVLink and Ethernet endpoints — rerun the numbers for IB and you'll see why cross-node TP is merely painful rather than absurd on real clusters, and why it's still avoided when PP can serve.
Hidden size moves the bill linearly — a 70B model (hidden 8192) doubles every payload, and its compute per step is ~9× bigger; comm fraction actually improves with model size. Small models are the worst TP candidates twice over (less to split, same hop count).

Going further

Add an overlap_fraction parameter (comm hidden under compute) and find the break-even overlap that makes TP=2-over-IB match TP=2-over-NVLink for decode. You've quantified what async/overlapped all-reduce engineering is worth.
Model TP × batch: comm payload grows with batch but compute grows too — plot comm fraction vs batch size for decode and find where Ethernet TP becomes tolerable (large-batch throughput serving — which is exactly when you didn't need TP's latency win anyway; the conclusion writes itself).
Compute the bill for expert parallelism's all-to-all (Phase 7 lab-04's missing line): payload = routed tokens × hidden, frequency = 2 per MoE layer. Compare against TP's — you'll see why DeepSeek-scale MoE deployments obsess over network topology in a way dense-model TP never had to.

References

upstream/vllm/distributed/device_communicators/custom_all_reduce.py — the latency-term workaround, in production.
NVIDIA NCCL docs, Collective Operations — ring/tree algorithms and their cost models: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html
Shoeybi et al., Megatron-LM (2019) — the column→row TP scheme and its two all-reduces per layer: https://arxiv.org/abs/1909.08053
Pope et al., Efficiently Scaling Transformer Inference (2022) — §3's communication analysis, the rigorous version of this lab: https://arxiv.org/abs/2211.05102
Lab-01 — the math being priced; lab-04 — the alternative with the opposite bill.

vLLM Mastery — From Zero to Maintainer