Phase 07 — The Hitchhiker's Guide to GEMM & MoE Kernels

← Phase 06 · Course home · Phase 08 →

Don't Panic
Step 1: GEMM — the workhorse and its kernels
Step 2: MoE — sparse compute, dense capacity
Step 3: Why fused MoE kernels matter
Step 4: Expert parallelism (EP) — experts across GPUs
The invariants to memorize
What you'll do

Don't Panic

GEMM = General Matrix-Matrix Multiply. It's ~all the FLOPs in a transformer — every linear layer is a GEMM. MoE (Mixture of Experts) makes it weirder: instead of one big MLP, there are many expert MLPs, and each token is routed to only a few of them. That turns the dense GEMM into a routed, grouped GEMM — the frontier of open models (Mixtral, DeepSeek-V3, GPT-OSS) and where a lot of vLLM's current performance work lives. This phase is the kernels and the MoE machinery that make big sparse models fast.

Dense MLP:  x ─► [ one big W ] ─► y                       (every token, all weights)

MoE MLP:    x ─► gate ─► top-k experts ─► run only those ─► weighted combine
                                 │
                 token 1 → experts {3, 7}     "sparse": each token uses few of many experts
                 token 2 → experts {0, 7}

Step 1: GEMM — the workhorse and its kernels

A transformer is mostly matmuls: QKV projection, attention output, the two MLP matrices, the LM head. Making these fast is the job of GEMM libraries:

cuBLAS — NVIDIA's baseline.
CUTLASS — NVIDIA's open, composable GEMM templates; vLLM uses it heavily for quantized GEMMs (FP8/INT8, Phase 6).
TRTLLM-GEN / CuTeDSL — generated/DSL kernels tuned per GPU and precision.

The reason there are so many: a GEMM kernel must be tiled to fit the GPU's memory hierarchy and specialized per dtype (fp16 vs fp8 vs int4) to use the right tensor cores. The quant format from Phase 6 dictates which GEMM kernel runs.

Step 2: MoE — sparse compute, dense capacity

A MoE layer replaces the dense MLP with E experts (each its own MLP). A router (a small linear "gate") scores the experts per token; each token goes to its top-k (e.g. top-2). So a model can have huge total parameters (capacity) but only activate a few experts per token (cheap compute). DeepSeek-V3 has 256 experts but activates ~8 per token.

The MoE forward, step by step (you'll build this in lab-01):

1. router:    logits = x @ W_gate        → (tokens, E)
2. top-k:     pick the k best experts per token + their weights (softmax over the k)
3. permute:   group tokens by their assigned expert (so each expert's tokens are contiguous)
4. grouped GEMM:  run each expert's MLP on its block of tokens
5. un-permute: scatter results back to original token order
6. combine:   weighted sum of each token's k expert outputs (by the gate weights)

Steps 3 & 5 (the permute/un-permute) exist because GPUs want contiguous work per expert — you can't efficiently do "token 1 → expert 3, token 2 → expert 0" as scattered tiny matmuls. Sorting tokens by expert turns it into a few big grouped GEMMs.

Step 3: Why fused MoE kernels matter

Done naively, MoE is a gather + many small GEMMs + a scatter — launch-bound and memory-bound (Phase 5's enemy, at the kernel level). Fused MoE kernels combine routing, the grouped GEMM, and the combine into one (or few) kernels, keeping tensor cores busy and killing launch overhead. This is decisive for MoE throughput and is exactly what vllm/model_executor/layers/fused_moe/ provides (Triton and CUTLASS variants).

Step 4: Expert parallelism (EP) — experts across GPUs

Experts are independent, so you can place different experts on different GPUs. Each step, tokens are shuffled to wherever their expert lives (an all-to-all collective), run, and shuffled back. EP scales the number of experts cheaply, at the cost of communication and load balancing (if everyone routes to expert 7, that GPU is the bottleneck). Contrast with tensor parallelism (Phase 10), which shards each expert's weights across GPUs. Real deployments combine EP for the MoE layers with DP/TP for attention.

The invariants to memorize

GEMM = the FLOPs; CUTLASS/TRTLLM-GEN/CuTeDSL are the fast, dtype-specialized kernels.
MoE = router → top-k → permute → grouped GEMM → un-permute → weighted combine.
Permute/un-permute exist to make per-expert work contiguous (big GEMMs, not scattered tiny ones).
Fused MoE kernels remove the gather/scatter launch + memory overhead.
EP spreads experts across GPUs (all-to-all + load balancing); TP shards each expert.
The quant format (Phase 6) selects the GEMM kernel.

What you'll do

Read: 01-deep-dive.md — FusedMoE, the fused kernel + fused_experts, permute/un-permute, and a real MoE model (Mixtral), line-anchored.
Build: 02-mini-build.md — top-k routing + grouped experts + combine.
Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
- lab-01-moe-routing [CPU-OK] — implement the full MoE forward in numpy; prove it equals a reference and that permute/un-permute round-trips.
- lab-02-profile-fused-moe [GPU-OPT] — profile fused MoE's share of step time (captured).
- lab-03-tiled-gemm [CPU-OK] — tiling and the memory-traffic model: reuse = harmonic mean of tile dims; why decode (M=1) caps at reuse 2 and no tile size can save it.
- lab-04-expert-load-balance [CPU-OK] — loads, imbalance, EP step time = max device load; prove a hot expert inflates the step >2.5× at identical total work; capacity-factor drops.
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.