Phase 10 — Deep Dive: distributed inference in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0.

vllm/distributed/parallel_state.py     the source of truth for all parallel groups (TP/PP/DP/EP/CP)
vllm/distributed/communication_op.py   tensor_model_parallel_all_reduce / all_gather
vllm/model_executor/layers/linear.py   Column/Row/QKV ParallelLinear (TP in the layers)
vllm/v1/executor/multiproc_executor.py the N-worker executor
vllm/v1/worker/gpu_worker.py           one worker = one GPU = one model shard

1. Who's in which group: parallel_state.py
2. The collectives: communication_op.py
3. TP in the layers: linear.py
4. The executor and workers
5. PP, DP, EP, CP pointers
Reading checklist

1. Who's in which group: `parallel_state.py`

This file owns every process group. Key functions:

init_distributed_environment (:1370) and initialize_model_parallel (:1506) — set up the TP/PP/DP/EP/CP groups at startup from the configured sizes.
get_tp_group (:1241), get_pp_group (:1260) — the group a worker uses to communicate.
get_tensor_model_parallel_world_size (:1849) / _rank (:1854) — "how many TP peers, which one am I." The parallel layers read these to know how to shard.

Mental model: this module answers, for each worker, "who are my teammates for each kind of parallelism, and what's my index?" Everything else (the layers, the executor) consults it.

2. The collectives: `communication_op.py`

tensor_model_parallel_all_reduce (:12) and tensor_model_parallel_all_gather (:17) are the two operations TP needs (Step 2 of the guide). They wrap NCCL (the NVIDIA collective library) via device communicators (distributed/device_communicators/). An all-reduce sums a tensor across all TP ranks and gives everyone the result; an all-gather concatenates each rank's piece. These are the "combine notes" steps from the analogy.

3. TP in the layers: `linear.py`

This is where TP actually happens — and notice the model never calls a collective directly; the layers do.

ColumnParallelLinear (:410) — shards the weight by output dimension; each rank computes part of the output. Used for the first matmul in a block (QKV, gate/up). QKVParallelLinear (:975) and MergedColumnParallelLinear (:607) are specializations (they pack Q/K/V or gate/up into one sharded matmul).
RowParallelLinear (:1392) — shards by input dimension; each rank computes a partial of the full output, then all-reduces. Used for the second matmul (attention o_proj, MLP down).

The pairing (column then row) means the column output stays sharded and feeds the row input with no intervening communication — one all-reduce per block (guide Step 2). Read RowParallelLinear.forward and find the tensor_model_parallel_all_reduce call: that's the one communication. Your lab-01 reproduces this exact pattern and proves the result equals the unsharded matmul.

4. The executor and workers

vllm/v1/executor/multiproc_executor.py: class MultiprocExecutor(Executor) (:102), execute_model (:306). It spawns one worker process per GPU and broadcasts each step's SchedulerOutput to all of them; they run the forward in lockstep, exchanging collectives inside the layers, and rank 0 returns the sampled tokens. vllm/v1/worker/gpu_worker.py: class Worker (:109) holds one GPU's device, model shard, and KV cache; execute_model (:781) runs the shard. So the Phase 1 chain (EngineCore → Executor → Worker → ModelRunner) just widens to N workers for parallelism — the engine logic above it is unchanged.

5. PP, DP, EP, CP pointers

PP: get_pp_group + the model splitting layers across ranks; activations are sent rank→rank (point-to-point) between pipeline stages, with micro-batching to fill the bubble.
DP: replicas with request routing; also DP-attention for MoE models (attention DP while experts are EP).
EP: fused_moe/all2all_utils.py (Phase 7) + distributed/eplb/ (expert load balancing).
CP: context-parallel groups in parallel_state.py split one sequence's KV across ranks.

Reading checklist

initialize_model_parallel — what groups does it create, and from what sizes?
ColumnParallelLinear vs RowParallelLinear — what does each shard, and which all-reduces?
Find the single all_reduce in RowParallelLinear.forward.
MultiprocExecutor — what does it broadcast, and how many worker processes for TP=4?
Why is the model code unchanged whether TP=1 or TP=8?

Now build it: 02-mini-build.md, then the labs.

vLLM Mastery — From Zero to Maintainer