Phase 10 — Deep Dive: distributed inference in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0.vllm/distributed/parallel_state.py the source of truth for all parallel groups (TP/PP/DP/EP/CP) vllm/distributed/communication_op.py tensor_model_parallel_all_reduce / all_gather vllm/model_executor/layers/linear.py Column/Row/QKV ParallelLinear (TP in the layers) vllm/v1/executor/multiproc_executor.py the N-worker executor vllm/v1/worker/gpu_worker.py one worker = one GPU = one model shard
Contents
- 1. Who's in which group:
parallel_state.py - 2. The collectives:
communication_op.py - 3. TP in the layers:
linear.py - 4. The executor and workers
- 5. PP, DP, EP, CP pointers
- Reading checklist
1. Who's in which group: parallel_state.py
This file owns every process group. Key functions:
init_distributed_environment(:1370) andinitialize_model_parallel(:1506) — set up the TP/PP/DP/EP/CP groups at startup from the configured sizes.get_tp_group(:1241),get_pp_group(:1260) — the group a worker uses to communicate.get_tensor_model_parallel_world_size(:1849) /_rank(:1854) — "how many TP peers, which one am I." The parallel layers read these to know how to shard.
Mental model: this module answers, for each worker, "who are my teammates for each kind of parallelism, and what's my index?" Everything else (the layers, the executor) consults it.
2. The collectives: communication_op.py
tensor_model_parallel_all_reduce (:12) and tensor_model_parallel_all_gather (:17) are the
two operations TP needs (Step 2 of the guide). They wrap NCCL (the NVIDIA collective library) via
device communicators (distributed/device_communicators/). An all-reduce sums a tensor across
all TP ranks and gives everyone the result; an all-gather concatenates each rank's piece. These
are the "combine notes" steps from the analogy.
3. TP in the layers: linear.py
This is where TP actually happens — and notice the model never calls a collective directly; the layers do.
ColumnParallelLinear(:410) — shards the weight by output dimension; each rank computes part of the output. Used for the first matmul in a block (QKV, gate/up).QKVParallelLinear(:975) andMergedColumnParallelLinear(:607) are specializations (they pack Q/K/V or gate/up into one sharded matmul).RowParallelLinear(:1392) — shards by input dimension; each rank computes a partial of the full output, then all-reduces. Used for the second matmul (attentiono_proj, MLPdown).
The pairing (column then row) means the column output stays sharded and feeds the row input with no
intervening communication — one all-reduce per block (guide Step 2). Read RowParallelLinear.forward
and find the tensor_model_parallel_all_reduce call: that's the one communication. Your lab-01
reproduces this exact pattern and proves the result equals the unsharded matmul.
4. The executor and workers
vllm/v1/executor/multiproc_executor.py: class MultiprocExecutor(Executor) (:102),
execute_model (:306). It spawns one worker process per GPU and broadcasts each step's
SchedulerOutput to all of them; they run the forward in lockstep, exchanging collectives inside
the layers, and rank 0 returns the sampled tokens. vllm/v1/worker/gpu_worker.py: class Worker
(:109) holds one GPU's device, model shard, and KV cache; execute_model (:781) runs the
shard. So the Phase 1 chain (EngineCore → Executor → Worker → ModelRunner) just widens to N
workers for parallelism — the engine logic above it is unchanged.
5. PP, DP, EP, CP pointers
- PP:
get_pp_group+ the model splitting layers across ranks; activations are sent rank→rank (point-to-point) between pipeline stages, with micro-batching to fill the bubble. - DP: replicas with request routing; also DP-attention for MoE models (attention DP while experts are EP).
- EP:
fused_moe/all2all_utils.py(Phase 7) +distributed/eplb/(expert load balancing). - CP: context-parallel groups in
parallel_state.pysplit one sequence's KV across ranks.
Reading checklist
-
initialize_model_parallel— what groups does it create, and from what sizes? -
ColumnParallelLinearvsRowParallelLinear— what does each shard, and which all-reduces? -
Find the single
all_reduceinRowParallelLinear.forward. -
MultiprocExecutor— what does it broadcast, and how many worker processes for TP=4? - Why is the model code unchanged whether TP=1 or TP=8?
Now build it: 02-mini-build.md, then the labs.