Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 10 — Cheatsheet: Distributed Inference

Contents


The five splits

splitscommswhere
TP tensoreach layer's weightsall-reduce every layerwithin a node (NVLink)
PP pipelinelayers across GPUspoint-to-point + bubblesacross nodes
DP datafull replicasnone on the work; route requestsmodel must fit
EP expertMoE experts across GPUsall-to-allMoE layers
CP contextone sequence's KValong the sequenceultra-long context

TP math (the one to know)

  • column-parallel: split W by output cols → all-gather. row-parallel: split W by input rows + split x → all-reduce (sum).
  • block pattern: column then row → one all-reduce per block; result identical to single-GPU.
  • TP all-reduces every layer → needs fast links → TP within a node, PP across nodes.

Who runs it

EngineCore → MultiprocExecutor → N Worker processes (1/GPU) → ModelRunner. Collectives happen inside the parallel Linear layers; groups/ranks in parallel_state.py. Model code unchanged for any parallel size.

Combine for scale

e.g. TP=8 in-node + PP=2 across nodes + DP replicas + EP for MoE. Choosing the mix for a model+SLA is the staff decision.

Key upstream

  • distributed/parallel_state.py:1370 init :1506 initialize_model_parallel :1241 get_tp_group :1849 tp_world_size
  • distributed/communication_op.py:12 all_reduce :17 all_gather
  • layers/linear.py:410 ColumnParallelLinear :975 QKVParallelLinear :1392 RowParallelLinear
  • v1/executor/multiproc_executor.py:102 MultiprocExecutor · v1/worker/gpu_worker.py:109 Worker

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md