Phase 07 — Exercises: GEMM & MoE Kernels

Contents

Warm-up (explain)
Core (trace the code)
Build (your lab)
Design (staff-level)
Self-grading

Warm-up (explain)

What is a GEMM and why is "most of a transformer is GEMMs" true?
In one sentence each: router, top-k, expert, combine.
Why does MoE give "huge capacity, cheap compute"? What's actually cheap?

Core (trace the code)

List the 6 steps of an MoE forward and which fused_moe/ file implements each.
Why do permute + un-permute exist (moe_align_block_size.py)? What goes wrong without them?
In MixtralMoE (mixtral.py:77), how few lines is the MoE block once FusedMoE exists, and why?
What does a fused MoE kernel fuse, and which Phase-5 problem does that address?

Build (your lab)

In lab-01, why must moe_forward_grouped use np.add.at (not out[toks] += ...)? (Hint: a token can route to two experts.)
Add expert load metrics: count tokens per expert; construct an input that overloads one expert and explain the throughput impact (load imbalance).
Add a shared expert (always-on, added to every token, DeepSeek-style) and keep grouped == reference.

Design (staff-level)

EP vs TP for a 256-expert MoE on 8 GPUs: what does each shard, what comms does each add, and when would you combine them?
Your MoE serving is throughput-bound and a profile shows the grouped GEMM at 45% but with low tensor-core utilization. Name two likely causes and fixes.
Expert load is skewed (a few hot experts). What mitigations exist (capacity, aux loss at train time, routing tweaks), and which are available at serving time?

Self-grading

4–7 and 11–13 are interview-grade. Could you draw the MoE forward and name the files? If not, re-read 01-deep-dive.md.