Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 10 Labs — Distributed Inference

Four labs on splitting one model across many GPUs. The arc: prove tensor parallelism's algebra and the one-all-reduce pairing (lab-01), price its communication and derive the within-a-node rule (lab-03), meet the cross-node alternative and its bubble (lab-04), then watch TP=2 split a real model's weights and KV on real hardware (lab-02).

Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: math, bill, alternative, demo.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-10-distributed-inference/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-01-tp-sharding-math -q

Contents


Labs

lab-01-tp-sharding-math [CPU-OK]

Tensor parallelism as provable algebra: column-parallel (slice outputs, all-gather) and row-parallel (slice inputs, all-reduce) reconstruct the dense result exactly, and the Megatron column→row pairing makes a whole MLP cost one all-reduce — asserted by a counter, not claimed. Includes the divisibility constraint that caps real TP sizes. Skills: the two shardings; communication designed out, not optimized out; mapping to ColumnParallelLinear/RowParallelLinear.

lab-02-two-way-tp [GPU-OPT]

tensor_parallel_size=2 live: two worker processes, 1.24 + 1.24 = 2.48 GiB of weights, per-rank KV blocks, and output matching TP=1 (to the last ulp's mercy). The observable surface of TP — and how to reconcile every log line against labs 01/03. Annotated capture included. Skills: reading per-rank memory/block reports; lockstep workers and the slowest-rank rule; when two TP=1 replicas beat TP=2.

lab-03-tp-comm-cost [CPU-OK]

The bill: 2 all-reduces × 32 layers × an 8 KB decode payload, priced with the ring formula on NVLink, PCIe, and Ethernet. Derives "TP within a node, never across" as an assert (>40% of the step lost to latency on 10 GbE) — and the subtler split: decode comm is latency-bound, prefill comm is bandwidth-bound, so the right interconnect depends on the workload. Skills: the ring all-reduce cost model; latency vs bandwidth regimes; pricing EP's all-to-all with the same tools.

lab-04-pipeline-bubble [CPU-OK]

The cross-node alternative: stages by layer, one activation handoff per boundary — and the bubble, (p−1)/(p+m−1), derived twice (algebra and a simulated schedule grid that must reconcile exactly). p=8 under a 10% bubble budget needs 63 in-flight microbatches: PP's economics are batch economics. Skills: fill-drain geometry; PP buys throughput and nothing for latency; TP×PP composition; stragglers, third appearance.

What you can do after this phase

Decide, from arithmetic, how to place a model on a cluster: minimum TP for fit, TP vs data-parallel replicas for throughput, TP×PP composition across nodes, and what each choice costs in collectives or bubbles; read a distributed deployment's startup logs as a checksum of the sharding; and debug the classics (slow rank drags the ensemble, cross-node TP melting p99, PP starving at low traffic) from models you built rather than lore. Phase 15 splits the workload (prefill from decode) where this phase split the model.