Phase 07 — Cheatsheet: GEMM & MoE Kernels

The one-liner
GEMM kernels
MoE forward (6 steps)
Fused MoE
Parallelism
Key upstream

The one-liner

GEMM = the FLOPs (every linear layer). MoE = many expert MLPs + a router; each token uses top-k experts → huge capacity, cheap compute. The work becomes a routed, grouped GEMM.

GEMM kernels

cuBLAS (baseline) · CUTLASS (quantized/composable) · TRTLLM-GEN / CuTeDSL (generated, per-GPU). Specialized per dtype (fp16/fp8/int4); the quant format (Phase 6) picks the kernel.

router (gate linear) → top-k experts + softmax weights → permute (group tokens by expert) → grouped GEMM (each expert on its block) → un-permute → combine (weighted sum). Permute/un-permute = make per-expert work contiguous (big matmuls, not scattered tiny ones).

Fused MoE

One/few kernels do routing+grouped-GEMM+combine → removes gather/scatter launch + memory overhead. The experts (grouped GEMM) dominate step time; the router is cheap.

Parallelism

EP (expert parallel): whole experts on different GPUs; all-to-all ships tokens; watch load balance (hot experts).
TP: shard each expert's weights across GPUs (per-layer all-reduce).
Often EP for MoE + DP/TP for attention.

Key upstream

fused_moe/layer.py:73 FusedMoE · :1306 forward
fused_moe/fused_moe.py:295 fused_moe_kernel · :1587 fused_experts · :1664 fused_experts_impl
fused_moe/moe_align_block_size.py · moe_permute_unpermute.py · all2all_utils.py (EP)
models/mixtral.py:77 MixtralMoE · models/deepseek_v2.py (shared experts + MLA)

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

vLLM Mastery — From Zero to Maintainer