Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 07 — Interview Questions: GEMM & MoE Kernels

Q1. What is MoE and why is it attractive?

Model answer

A Mixture-of-Experts layer replaces the dense MLP with many expert MLPs and a router that sends each token to its top-k experts (e.g. 2 of 256). So the model has huge total parameters (capacity/quality) but activates only a few experts per token (cheap compute). DeepSeek-V3 has 256 experts, ~8 active. The cost moves from FLOPs to moving tokens to the right experts and memory for all those weights.

Q2. Walk through the MoE forward.

Model answer

router (small linear) → top-k expert selection + softmax weights → permute tokens so each expert's tokens are contiguous → grouped GEMM (run each expert's MLP on its block) → un- permute back to token order → weighted combine of each token's k expert outputs. Permute/ un-permute exist so the GPU does a few big matmuls instead of many scattered tiny ones. (fused_moe/fused_moe.py, moe_align_block_size.py, layer.py.)

Q3. Why fused MoE kernels?

Model answer

Naive MoE is a gather + many small per-expert GEMMs + a scatter — launch-bound and memory-bound. A fused kernel does routing/grouped-GEMM/combine in one (or few) kernels indexed by a sorted token→expert map, keeping tensor cores busy and removing launch overhead. It's decisive for MoE throughput (the profile in lab-02 shows the grouped GEMM dominating).

Q4. Expert parallelism vs tensor parallelism for MoE?

Model answer

EP places whole experts on different GPUs; tokens are shipped to their expert's GPU via all-to-all and back. It scales expert count cheaply but adds communication and load-balancing risk (a hot expert bottlenecks its GPU). TP shards each expert's weights across GPUs (per-layer all-reduce). Real deployments often use EP for the MoE layers and DP/TP for attention, since the two have different parallelism sweet spots.

Q5. What sets which GEMM kernel runs?

Model answer

The dtype/quant format (Phase 6) and hardware. CUTLASS/TRTLLM-GEN/CuTeDSL provide kernels specialized per precision (fp16/fp8/int4) and tiled to the GPU's memory hierarchy; a quantized weight needs the matching kernel (e.g. Marlin for INT4, scaled-mm for FP8). Mismatch is wrong or slow — that's why quant format and kernel are chosen together.

Rapid-fire

  • MoE step order? router → top-k → permute → grouped GEMM → un-permute → combine.
  • Why permute? contiguous per-expert work → big matmuls, not scattered tiny ones.
  • EP shards? whole experts (all-to-all). TP shards? each expert's weights.
  • Router cost? tiny; the experts (GEMM) dominate.
  • Famous open MoEs? Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS.