Phase 10 — Exercises: Distributed Inference
Contents
Warm-up (explain)
- One line each: TP, PP, DP, EP, CP — what gets split?
- What's an all-reduce vs an all-gather? Which does row-parallel use?
- Why "TP within a node, PP across nodes"?
Core (trace the code)
- In
linear.py, what doesColumnParallelLinearshard vsRowParallelLinear? Where's the one all-reduce (:1392)? - Why does the column→row pairing need only one all-reduce per transformer block?
- In
MultiprocExecutor(multiproc_executor.py:102), how many worker processes for TP=4, and what does it broadcast each step? - Why is the model code identical for TP=1 and TP=8?
Build (your lab)
- In lab-01, prove
row_parallelreconstructsx@W.Tfornum_ranks=8. Why is summing partials the correct combine (not concatenation)? - Add a
qkv_parallelthat column-shards a fused QKV weight; verify it equals the unsharded QKV. - Count communications for a full transformer block (attention + MLP) under your TP impl. Is it 2 all-reduces? Why?
Design (staff-level)
- Serve a 70B model on 8×A100-80GB for (a) lowest latency, (b) highest throughput. Pick TP/PP/DP for each and justify with the communication patterns.
- You scale TP from 2 to 8 and throughput barely improves. Diagnose (communication-bound) and propose alternatives.
- For a 256-expert MoE on 16 GPUs, how would you combine EP (experts) with DP/TP (attention), and what's the main risk (load imbalance, all-to-all cost)?
Self-grading
4–7 and 11–13 are interview-grade. Could you draw the col→row TP pattern and the worker fan-out? If not, re-read 01-deep-dive.md.