Phase 10 — Cheatsheet: Distributed Inference

The five splits
TP math (the one to know)
Who runs it
Combine for scale
Key upstream

The five splits

	splits	comms	where
TP tensor	each layer's weights	all-reduce every layer	within a node (NVLink)
PP pipeline	layers across GPUs	point-to-point + bubbles	across nodes
DP data	full replicas	none on the work; route requests	model must fit
EP expert	MoE experts across GPUs	all-to-all	MoE layers
CP context	one sequence's KV	along the sequence	ultra-long context

TP math (the one to know)

column-parallel: split W by output cols → all-gather. row-parallel: split W by input rows + split x → all-reduce (sum).
block pattern: column then row → one all-reduce per block; result identical to single-GPU.
TP all-reduces every layer → needs fast links → TP within a node, PP across nodes.

EngineCore → MultiprocExecutor → N Worker processes (1/GPU) → ModelRunner. Collectives happen inside the parallel Linear layers; groups/ranks in parallel_state.py. Model code unchanged for any parallel size.

Combine for scale

e.g. TP=8 in-node + PP=2 across nodes + DP replicas + EP for MoE. Choosing the mix for a model+SLA is the staff decision.

Key upstream

distributed/parallel_state.py:1370 init :1506 initialize_model_parallel :1241 get_tp_group :1849 tp_world_size
distributed/communication_op.py:12 all_reduce :17 all_gather
layers/linear.py:410 ColumnParallelLinear :975 QKVParallelLinear :1392 RowParallelLinear
v1/executor/multiproc_executor.py:102 MultiprocExecutor · v1/worker/gpu_worker.py:109 Worker

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

vLLM Mastery — From Zero to Maintainer

Phase 10 — Cheatsheet: Distributed Inference

Contents

The five splits

TP math (the one to know)

Who runs it

Combine for scale

Key upstream

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 10 — Cheatsheet: Distributed Inference

Contents

The five splits

TP math (the one to know)

Who runs it

Combine for scale

Key upstream