Phase 10 — Exercises: Distributed Inference

Contents

Warm-up (explain)
Core (trace the code)
Build (your lab)
Design (staff-level)
Self-grading

Warm-up (explain)

One line each: TP, PP, DP, EP, CP — what gets split?
What's an all-reduce vs an all-gather? Which does row-parallel use?
Why "TP within a node, PP across nodes"?

Core (trace the code)

In linear.py, what does ColumnParallelLinear shard vs RowParallelLinear? Where's the one all-reduce (:1392)?
Why does the column→row pairing need only one all-reduce per transformer block?
In MultiprocExecutor (multiproc_executor.py:102), how many worker processes for TP=4, and what does it broadcast each step?
Why is the model code identical for TP=1 and TP=8?

Build (your lab)

In lab-01, prove row_parallel reconstructs x@W.T for num_ranks=8. Why is summing partials the correct combine (not concatenation)?
Add a qkv_parallel that column-shards a fused QKV weight; verify it equals the unsharded QKV.
Count communications for a full transformer block (attention + MLP) under your TP impl. Is it 2 all-reduces? Why?

Design (staff-level)

Serve a 70B model on 8×A100-80GB for (a) lowest latency, (b) highest throughput. Pick TP/PP/DP for each and justify with the communication patterns.
You scale TP from 2 to 8 and throughput barely improves. Diagnose (communication-bound) and propose alternatives.
For a 256-expert MoE on 16 GPUs, how would you combine EP (experts) with DP/TP (attention), and what's the main risk (load imbalance, all-to-all cost)?

Self-grading

4–7 and 11–13 are interview-grade. Could you draw the col→row TP pattern and the worker fan-out? If not, re-read 01-deep-dive.md.