Phase 07 — Cheatsheet: GEMM & MoE Kernels
Contents
The one-liner
GEMM = the FLOPs (every linear layer). MoE = many expert MLPs + a router; each token uses top-k experts → huge capacity, cheap compute. The work becomes a routed, grouped GEMM.
GEMM kernels
cuBLAS (baseline) · CUTLASS (quantized/composable) · TRTLLM-GEN / CuTeDSL (generated, per-GPU). Specialized per dtype (fp16/fp8/int4); the quant format (Phase 6) picks the kernel.
MoE forward (6 steps)
router (gate linear) → top-k experts + softmax weights → permute (group tokens by expert) → grouped GEMM (each expert on its block) → un-permute → combine (weighted sum). Permute/un-permute = make per-expert work contiguous (big matmuls, not scattered tiny ones).
Fused MoE
One/few kernels do routing+grouped-GEMM+combine → removes gather/scatter launch + memory overhead. The experts (grouped GEMM) dominate step time; the router is cheap.
Parallelism
- EP (expert parallel): whole experts on different GPUs; all-to-all ships tokens; watch load balance (hot experts).
- TP: shard each expert's weights across GPUs (per-layer all-reduce).
- Often EP for MoE + DP/TP for attention.
Key upstream
fused_moe/layer.py:73FusedMoE ·:1306forwardfused_moe/fused_moe.py:295fused_moe_kernel ·:1587fused_experts ·:1664fused_experts_implfused_moe/moe_align_block_size.py·moe_permute_unpermute.py·all2all_utils.py(EP)models/mixtral.py:77MixtralMoE ·models/deepseek_v2.py(shared experts + MLA)
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md