Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 07 — Cheatsheet: GEMM & MoE Kernels

Contents


The one-liner

GEMM = the FLOPs (every linear layer). MoE = many expert MLPs + a router; each token uses top-k experts → huge capacity, cheap compute. The work becomes a routed, grouped GEMM.

GEMM kernels

cuBLAS (baseline) · CUTLASS (quantized/composable) · TRTLLM-GEN / CuTeDSL (generated, per-GPU). Specialized per dtype (fp16/fp8/int4); the quant format (Phase 6) picks the kernel.

MoE forward (6 steps)

router (gate linear) → top-k experts + softmax weights → permute (group tokens by expert) → grouped GEMM (each expert on its block) → un-permutecombine (weighted sum). Permute/un-permute = make per-expert work contiguous (big matmuls, not scattered tiny ones).

Fused MoE

One/few kernels do routing+grouped-GEMM+combine → removes gather/scatter launch + memory overhead. The experts (grouped GEMM) dominate step time; the router is cheap.

Parallelism

  • EP (expert parallel): whole experts on different GPUs; all-to-all ships tokens; watch load balance (hot experts).
  • TP: shard each expert's weights across GPUs (per-layer all-reduce).
  • Often EP for MoE + DP/TP for attention.

Key upstream

  • fused_moe/layer.py:73 FusedMoE · :1306 forward
  • fused_moe/fused_moe.py:295 fused_moe_kernel · :1587 fused_experts · :1664 fused_experts_impl
  • fused_moe/moe_align_block_size.py · moe_permute_unpermute.py · all2all_utils.py (EP)
  • models/mixtral.py:77 MixtralMoE · models/deepseek_v2.py (shared experts + MLA)

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md