Phase 10 Labs — Distributed Inference
Four labs on splitting one model across many GPUs. The arc: prove tensor parallelism's algebra and the one-all-reduce pairing (lab-01), price its communication and derive the within-a-node rule (lab-03), meet the cross-node alternative and its bubble (lab-04), then watch TP=2 split a real model's weights and KV on real hardware (lab-02).
Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: math,
bill, alternative, demo.) CPU labs follow the standard contract — starter.py (your
work), solution.py (reference), test_lab.py (the spec); default runs the solution,
LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-10-distributed-inference/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-01-tp-sharding-math -q
Contents
- lab-01-tp-sharding-math
[CPU-OK] - lab-02-two-way-tp
[GPU-OPT] - lab-03-tp-comm-cost
[CPU-OK] - lab-04-pipeline-bubble
[CPU-OK] - What you can do after this phase
Labs
lab-01-tp-sharding-math [CPU-OK]
Tensor parallelism as provable algebra: column-parallel (slice outputs, all-gather)
and row-parallel (slice inputs, all-reduce) reconstruct the dense result exactly, and
the Megatron column→row pairing makes a whole MLP cost one all-reduce — asserted by
a counter, not claimed. Includes the divisibility constraint that caps real TP sizes.
Skills: the two shardings; communication designed out, not optimized out; mapping to
ColumnParallelLinear/RowParallelLinear.
lab-02-two-way-tp [GPU-OPT]
tensor_parallel_size=2 live: two worker processes, 1.24 + 1.24 = 2.48 GiB of
weights, per-rank KV blocks, and output matching TP=1 (to the last ulp's mercy). The
observable surface of TP — and how to reconcile every log line against labs 01/03.
Annotated capture included. Skills: reading per-rank memory/block reports; lockstep
workers and the slowest-rank rule; when two TP=1 replicas beat TP=2.
lab-03-tp-comm-cost [CPU-OK]
The bill: 2 all-reduces × 32 layers × an 8 KB decode payload, priced with the ring formula on NVLink, PCIe, and Ethernet. Derives "TP within a node, never across" as an assert (>40% of the step lost to latency on 10 GbE) — and the subtler split: decode comm is latency-bound, prefill comm is bandwidth-bound, so the right interconnect depends on the workload. Skills: the ring all-reduce cost model; latency vs bandwidth regimes; pricing EP's all-to-all with the same tools.
lab-04-pipeline-bubble [CPU-OK]
The cross-node alternative: stages by layer, one activation handoff per boundary —
and the bubble, (p−1)/(p+m−1), derived twice (algebra and a simulated schedule grid
that must reconcile exactly). p=8 under a 10% bubble budget needs 63 in-flight
microbatches: PP's economics are batch economics. Skills: fill-drain geometry; PP
buys throughput and nothing for latency; TP×PP composition; stragglers, third
appearance.
What you can do after this phase
Decide, from arithmetic, how to place a model on a cluster: minimum TP for fit, TP vs data-parallel replicas for throughput, TP×PP composition across nodes, and what each choice costs in collectives or bubbles; read a distributed deployment's startup logs as a checksum of the sharding; and debug the classics (slow rank drags the ensemble, cross-node TP melting p99, PP starving at low traffic) from models you built rather than lore. Phase 15 splits the workload (prefill from decode) where this phase split the model.