Phase 10 — Mini-Build: tensor parallelism in numpy

You'll implement column- and row-parallel matmuls and prove that splitting a layer across "GPUs" and combining (all-gather / all-reduce) gives exactly the single-GPU result. No real GPUs — we simulate num_ranks shards with array slicing. This makes TP concrete and dispels the "is the math still correct?" worry for good.

The task (lab-01)
The point (the invariant)
Definition of done
Map to the real engine

The task (lab-01)

A linear layer is y = x @ W.T, with W shape (out, in). Implement:

column_parallel(x, W, num_ranks) — split W's rows (output dim) across ranks; each rank computes its slice y_r = x @ W_r.T; concatenate (the all-gather). Must equal x @ W.T.
row_parallel(x, W, num_ranks) — split W's columns (input dim) and x's columns across ranks; each rank computes a partial y_r = x_r @ W_r.T over the whole output; sum them (the all-reduce). Must equal x @ W.T.
mlp_tp(x, W1, W2, num_ranks) — the real transformer pattern: W1 column-parallel (keep output sharded), apply the activation per shard, W2 row-parallel (one all-reduce). Must equal the dense relu(x @ W1) @ W2, with exactly one all-reduce.

The point (the invariant)

x @ W == all_reduce(x_shard @ W_shard) for row-parallel, and the column→row pairing needs only one all-reduce per block. Your tests assert reconstruction equals the unsharded result to machine precision — which is why TP is correct, not just plausible.

Definition of done

pytest phase-10-distributed-inference/labs -q

Map to the real engine

your numpy	real vLLM
`column_parallel`	`ColumnParallelLinear` (`linear.py:410`)
`row_parallel` + sum	`RowParallelLinear` + `tensor_model_parallel_all_reduce` (`linear.py:1392`, `communication_op.py:12`)
`mlp_tp` (col→row, one all-reduce)	the MLP/attention block's TP pattern
`num_ranks`, rank slicing	`parallel_state.py` world size / rank (`:1849`/`:1854`)
(running it for real)	`MultiprocExecutor` + N workers (`multiproc_executor.py:102`)

vLLM Mastery — From Zero to Maintainer

Phase 10 — Mini-Build: tensor parallelism in numpy

Contents

The task (lab-01)

The point (the invariant)

Definition of done

Map to the real engine

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 10 — Mini-Build: tensor parallelism in numpy

Contents

The task (lab-01)

The point (the invariant)

Definition of done

Map to the real engine