Phase 10 — Mini-Build: tensor parallelism in numpy
You'll implement column- and row-parallel matmuls and prove that splitting a layer across "GPUs"
and combining (all-gather / all-reduce) gives exactly the single-GPU result. No real GPUs — we
simulate num_ranks shards with array slicing. This makes TP concrete and dispels the "is the
math still correct?" worry for good.
Contents
The task (lab-01)
A linear layer is y = x @ W.T, with W shape (out, in). Implement:
column_parallel(x, W, num_ranks)— splitW's rows (output dim) across ranks; each rank computes its slicey_r = x @ W_r.T; concatenate (the all-gather). Must equalx @ W.T.row_parallel(x, W, num_ranks)— splitW's columns (input dim) andx's columns across ranks; each rank computes a partialy_r = x_r @ W_r.Tover the whole output; sum them (the all-reduce). Must equalx @ W.T.mlp_tp(x, W1, W2, num_ranks)— the real transformer pattern:W1column-parallel (keep output sharded), apply the activation per shard,W2row-parallel (one all-reduce). Must equal the denserelu(x @ W1) @ W2, with exactly one all-reduce.
The point (the invariant)
x @ W == all_reduce(x_shard @ W_shard) for row-parallel, and the column→row pairing needs only one
all-reduce per block. Your tests assert reconstruction equals the unsharded result to machine
precision — which is why TP is correct, not just plausible.
Definition of done
pytest phase-10-distributed-inference/labs -q
Map to the real engine
| your numpy | real vLLM |
|---|---|
column_parallel | ColumnParallelLinear (linear.py:410) |
row_parallel + sum | RowParallelLinear + tensor_model_parallel_all_reduce (linear.py:1392, communication_op.py:12) |
mlp_tp (col→row, one all-reduce) | the MLP/attention block's TP pattern |
num_ranks, rank slicing | parallel_state.py world size / rank (:1849/:1854) |
| (running it for real) | MultiprocExecutor + N workers (multiproc_executor.py:102) |