Phase 10 — Cheatsheet: Distributed Inference
Contents
The five splits
| splits | comms | where | |
|---|---|---|---|
| TP tensor | each layer's weights | all-reduce every layer | within a node (NVLink) |
| PP pipeline | layers across GPUs | point-to-point + bubbles | across nodes |
| DP data | full replicas | none on the work; route requests | model must fit |
| EP expert | MoE experts across GPUs | all-to-all | MoE layers |
| CP context | one sequence's KV | along the sequence | ultra-long context |
TP math (the one to know)
- column-parallel: split W by output cols → all-gather. row-parallel: split W by input rows + split x → all-reduce (sum).
- block pattern: column then row → one all-reduce per block; result identical to single-GPU.
- TP all-reduces every layer → needs fast links → TP within a node, PP across nodes.
Who runs it
EngineCore → MultiprocExecutor → N Worker processes (1/GPU) → ModelRunner. Collectives happen
inside the parallel Linear layers; groups/ranks in parallel_state.py. Model code unchanged for any
parallel size.
Combine for scale
e.g. TP=8 in-node + PP=2 across nodes + DP replicas + EP for MoE. Choosing the mix for a model+SLA is the staff decision.
Key upstream
distributed/parallel_state.py:1370 init :1506 initialize_model_parallel :1241 get_tp_group :1849 tp_world_sizedistributed/communication_op.py:12 all_reduce :17 all_gatherlayers/linear.py:410 ColumnParallelLinear :975 QKVParallelLinear :1392 RowParallelLinearv1/executor/multiproc_executor.py:102 MultiprocExecutor·v1/worker/gpu_worker.py:109 Worker
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md