Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 10 — Interview Questions: Distributed Inference

Q1. TP vs PP — when do you reach for each?

Model answer

TP splits every layer's math across GPUs, so all GPUs work on each token — great for latency, but it all-reduces every layer, so it needs fast intra-node links (NVLink). PP splits the layers across GPUs with cheap point-to-point handoffs, so it scales across nodes/memory, but adds pipeline bubbles (mitigated by micro-batching) and a bit of latency. Rule of thumb: TP within a node, PP across nodes; combine them for very large models.

Q2. Walk me through tensor-parallel matmuls.

Model answer

Column-parallel splits the weight by output columns: each GPU computes part of the output, combined by all-gather. Row-parallel splits by input rows (and the input): each GPU computes a partial of the whole output, combined by all-reduce (sum). vLLM does the first matmul in a block column-parallel and the second row-parallel, so the column output stays sharded and feeds the row input directly — one all-reduce per block. The combined result is bit-identical to single-GPU (lab-01 proves it).

Q3. What's a pipeline bubble and how is it reduced?

Model answer

In PP, downstream stages idle while the first stage processes the initial input — wasted GPU time called the bubble. Splitting the work into many micro-batches keeps the pipeline full: once it's primed, every stage is always working on some micro-batch. The bubble shrinks with more micro-batches but never fully disappears.

Q4. Why does MoE motivate expert parallelism + data-parallel attention?

Model answer

Experts are independent FFNs, so placing whole experts on different GPUs (EP) scales expert capacity with just an all-to-all to route tokens. Attention has different parallelism economics, so it's often run data-parallel across the same GPUs to balance work. Mixing EP (experts) with DP/TP (attention) is common for large MoE models; the main risks are all-to-all cost and expert load imbalance.

Q5. How does vLLM run the same model on 1 or 64 GPUs unchanged?

Model answer

The model uses parallel layers (ColumnParallelLinear/RowParallelLinear) that internally do the collectives, and parallel_state.py holds the group/rank bookkeeping. For multi-GPU the Executor becomes a MultiprocExecutor that spawns one worker process per GPU, each holding a shard, running in lockstep. The engine logic above (scheduler, sampler) and the model code are identical — only the executor fans out.

Rapid-fire

  • Row-parallel combine? all-reduce (sum). Column-parallel combine? all-gather (concat).
  • All-reduces per transformer block under TP? ~2 (one per attention + MLP), pattern = col→row each.
  • Collective library? NCCL. Group bookkeeping? parallel_state.py.
  • Workers for TP=4? 4 processes, one per GPU.
  • EP shards? whole experts (all-to-all). CP shards? one sequence's context/KV.