Phase 07 — Exercises: GEMM & MoE Kernels
Contents
Warm-up (explain)
- What is a GEMM and why is "most of a transformer is GEMMs" true?
- In one sentence each: router, top-k, expert, combine.
- Why does MoE give "huge capacity, cheap compute"? What's actually cheap?
Core (trace the code)
- List the 6 steps of an MoE forward and which
fused_moe/file implements each. - Why do permute + un-permute exist (
moe_align_block_size.py)? What goes wrong without them? - In
MixtralMoE(mixtral.py:77), how few lines is the MoE block onceFusedMoEexists, and why? - What does a fused MoE kernel fuse, and which Phase-5 problem does that address?
Build (your lab)
- In lab-01, why must
moe_forward_groupedusenp.add.at(notout[toks] += ...)? (Hint: a token can route to two experts.) - Add expert load metrics: count tokens per expert; construct an input that overloads one expert and explain the throughput impact (load imbalance).
- Add a shared expert (always-on, added to every token, DeepSeek-style) and keep grouped == reference.
Design (staff-level)
- EP vs TP for a 256-expert MoE on 8 GPUs: what does each shard, what comms does each add, and when would you combine them?
- Your MoE serving is throughput-bound and a profile shows the grouped GEMM at 45% but with low tensor-core utilization. Name two likely causes and fixes.
- Expert load is skewed (a few hot experts). What mitigations exist (capacity, aux loss at train time, routing tweaks), and which are available at serving time?
Self-grading
4–7 and 11–13 are interview-grade. Could you draw the MoE forward and name the files? If not, re-read 01-deep-dive.md.