Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 10 — Mini-Build: tensor parallelism in numpy

You'll implement column- and row-parallel matmuls and prove that splitting a layer across "GPUs" and combining (all-gather / all-reduce) gives exactly the single-GPU result. No real GPUs — we simulate num_ranks shards with array slicing. This makes TP concrete and dispels the "is the math still correct?" worry for good.

Contents


The task (lab-01)

A linear layer is y = x @ W.T, with W shape (out, in). Implement:

  • column_parallel(x, W, num_ranks) — split W's rows (output dim) across ranks; each rank computes its slice y_r = x @ W_r.T; concatenate (the all-gather). Must equal x @ W.T.
  • row_parallel(x, W, num_ranks) — split W's columns (input dim) and x's columns across ranks; each rank computes a partial y_r = x_r @ W_r.T over the whole output; sum them (the all-reduce). Must equal x @ W.T.
  • mlp_tp(x, W1, W2, num_ranks) — the real transformer pattern: W1 column-parallel (keep output sharded), apply the activation per shard, W2 row-parallel (one all-reduce). Must equal the dense relu(x @ W1) @ W2, with exactly one all-reduce.

The point (the invariant)

x @ W == all_reduce(x_shard @ W_shard) for row-parallel, and the column→row pairing needs only one all-reduce per block. Your tests assert reconstruction equals the unsharded result to machine precision — which is why TP is correct, not just plausible.

Definition of done

pytest phase-10-distributed-inference/labs -q

Map to the real engine

your numpyreal vLLM
column_parallelColumnParallelLinear (linear.py:410)
row_parallel + sumRowParallelLinear + tensor_model_parallel_all_reduce (linear.py:1392, communication_op.py:12)
mlp_tp (col→row, one all-reduce)the MLP/attention block's TP pattern
num_ranks, rank slicingparallel_state.py world size / rank (:1849/:1854)
(running it for real)MultiprocExecutor + N workers (multiproc_executor.py:102)