Phase 07 — Interview Questions: GEMM & MoE Kernels
Q1. What is MoE and why is it attractive?
Model answer
A Mixture-of-Experts layer replaces the dense MLP with many expert MLPs and a router that sends each token to its top-k experts (e.g. 2 of 256). So the model has huge total parameters (capacity/quality) but activates only a few experts per token (cheap compute). DeepSeek-V3 has 256 experts, ~8 active. The cost moves from FLOPs to moving tokens to the right experts and memory for all those weights.
Q2. Walk through the MoE forward.
Model answer
router (small linear) → top-k expert selection + softmax weights → permute tokens so each
expert's tokens are contiguous → grouped GEMM (run each expert's MLP on its block) → un-
permute back to token order → weighted combine of each token's k expert outputs. Permute/
un-permute exist so the GPU does a few big matmuls instead of many scattered tiny ones.
(fused_moe/fused_moe.py, moe_align_block_size.py, layer.py.)
Q3. Why fused MoE kernels?
Model answer
Naive MoE is a gather + many small per-expert GEMMs + a scatter — launch-bound and memory-bound. A fused kernel does routing/grouped-GEMM/combine in one (or few) kernels indexed by a sorted token→expert map, keeping tensor cores busy and removing launch overhead. It's decisive for MoE throughput (the profile in lab-02 shows the grouped GEMM dominating).
Q4. Expert parallelism vs tensor parallelism for MoE?
Model answer
EP places whole experts on different GPUs; tokens are shipped to their expert's GPU via all-to-all and back. It scales expert count cheaply but adds communication and load-balancing risk (a hot expert bottlenecks its GPU). TP shards each expert's weights across GPUs (per-layer all-reduce). Real deployments often use EP for the MoE layers and DP/TP for attention, since the two have different parallelism sweet spots.
Q5. What sets which GEMM kernel runs?
Model answer
The dtype/quant format (Phase 6) and hardware. CUTLASS/TRTLLM-GEN/CuTeDSL provide kernels specialized per precision (fp16/fp8/int4) and tiled to the GPU's memory hierarchy; a quantized weight needs the matching kernel (e.g. Marlin for INT4, scaled-mm for FP8). Mismatch is wrong or slow — that's why quant format and kernel are chosen together.
Rapid-fire
- MoE step order? router → top-k → permute → grouped GEMM → un-permute → combine.
- Why permute? contiguous per-expert work → big matmuls, not scattered tiny ones.
- EP shards? whole experts (all-to-all). TP shards? each expert's weights.
- Router cost? tiny; the experts (GEMM) dominate.
- Famous open MoEs? Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS.