Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 10 — Exercises: Distributed Inference

Contents


Warm-up (explain)

  1. One line each: TP, PP, DP, EP, CP — what gets split?
  2. What's an all-reduce vs an all-gather? Which does row-parallel use?
  3. Why "TP within a node, PP across nodes"?

Core (trace the code)

  1. In linear.py, what does ColumnParallelLinear shard vs RowParallelLinear? Where's the one all-reduce (:1392)?
  2. Why does the column→row pairing need only one all-reduce per transformer block?
  3. In MultiprocExecutor (multiproc_executor.py:102), how many worker processes for TP=4, and what does it broadcast each step?
  4. Why is the model code identical for TP=1 and TP=8?

Build (your lab)

  1. In lab-01, prove row_parallel reconstructs x@W.T for num_ranks=8. Why is summing partials the correct combine (not concatenation)?
  2. Add a qkv_parallel that column-shards a fused QKV weight; verify it equals the unsharded QKV.
  3. Count communications for a full transformer block (attention + MLP) under your TP impl. Is it 2 all-reduces? Why?

Design (staff-level)

  1. Serve a 70B model on 8×A100-80GB for (a) lowest latency, (b) highest throughput. Pick TP/PP/DP for each and justify with the communication patterns.
  2. You scale TP from 2 to 8 and throughput barely improves. Diagnose (communication-bound) and propose alternatives.
  3. For a 256-expert MoE on 16 GPUs, how would you combine EP (experts) with DP/TP (attention), and what's the main risk (load imbalance, all-to-all cost)?

Self-grading

4–7 and 11–13 are interview-grade. Could you draw the col→row TP pattern and the worker fan-out? If not, re-read 01-deep-dive.md.