Phase 10 Labs — Distributed Inference

Four labs on splitting one model across many GPUs. The arc: prove tensor parallelism's algebra and the one-all-reduce pairing (lab-01), price its communication and derive the within-a-node rule (lab-03), meet the cross-node alternative and its bubble (lab-04), then watch TP=2 split a real model's weights and KV on real hardware (lab-02).

Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: math, bill, alternative, demo.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-10-distributed-inference/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-01-tp-sharding-math -q

lab-01-tp-sharding-math [CPU-OK]
lab-02-two-way-tp [GPU-OPT]
lab-03-tp-comm-cost [CPU-OK]
lab-04-pipeline-bubble [CPU-OK]
What you can do after this phase

Labs

lab-01-tp-sharding-math `[CPU-OK]`

Tensor parallelism as provable algebra: column-parallel (slice outputs, all-gather) and row-parallel (slice inputs, all-reduce) reconstruct the dense result exactly, and the Megatron column→row pairing makes a whole MLP cost one all-reduce — asserted by a counter, not claimed. Includes the divisibility constraint that caps real TP sizes. Skills: the two shardings; communication designed out, not optimized out; mapping to ColumnParallelLinear/RowParallelLinear.

lab-02-two-way-tp `[GPU-OPT]`

tensor_parallel_size=2 live: two worker processes, 1.24 + 1.24 = 2.48 GiB of weights, per-rank KV blocks, and output matching TP=1 (to the last ulp's mercy). The observable surface of TP — and how to reconcile every log line against labs 01/03. Annotated capture included. Skills: reading per-rank memory/block reports; lockstep workers and the slowest-rank rule; when two TP=1 replicas beat TP=2.

lab-03-tp-comm-cost `[CPU-OK]`

The bill: 2 all-reduces × 32 layers × an 8 KB decode payload, priced with the ring formula on NVLink, PCIe, and Ethernet. Derives "TP within a node, never across" as an assert (>40% of the step lost to latency on 10 GbE) — and the subtler split: decode comm is latency-bound, prefill comm is bandwidth-bound, so the right interconnect depends on the workload. Skills: the ring all-reduce cost model; latency vs bandwidth regimes; pricing EP's all-to-all with the same tools.

lab-04-pipeline-bubble `[CPU-OK]`

The cross-node alternative: stages by layer, one activation handoff per boundary — and the bubble, (p−1)/(p+m−1), derived twice (algebra and a simulated schedule grid that must reconcile exactly). p=8 under a 10% bubble budget needs 63 in-flight microbatches: PP's economics are batch economics. Skills: fill-drain geometry; PP buys throughput and nothing for latency; TP×PP composition; stragglers, third appearance.

What you can do after this phase

Decide, from arithmetic, how to place a model on a cluster: minimum TP for fit, TP vs data-parallel replicas for throughput, TP×PP composition across nodes, and what each choice costs in collectives or bubbles; read a distributed deployment's startup logs as a checksum of the sharding; and debug the classics (slow rank drags the ensemble, cross-node TP melting p99, PP starving at low traffic) from models you built rather than lore. Phase 15 splits the workload (prefill from decode) where this phase split the model.

vLLM Mastery — From Zero to Maintainer