Phase 07 — Interview Questions: GEMM & MoE Kernels

Q1. What is MoE and why is it attractive?

Model answer

A Mixture-of-Experts layer replaces the dense MLP with many expert MLPs and a router that sends each token to its top-k experts (e.g. 2 of 256). So the model has huge total parameters (capacity/quality) but activates only a few experts per token (cheap compute). DeepSeek-V3 has 256 experts, ~8 active. The cost moves from FLOPs to moving tokens to the right experts and memory for all those weights.

Q2. Walk through the MoE forward.

Model answer

router (small linear) → top-k expert selection + softmax weights → permute tokens so each expert's tokens are contiguous → grouped GEMM (run each expert's MLP on its block) → un- permute back to token order → weighted combine of each token's k expert outputs. Permute/ un-permute exist so the GPU does a few big matmuls instead of many scattered tiny ones. (fused_moe/fused_moe.py, moe_align_block_size.py, layer.py.)

Q3. Why fused MoE kernels?

Model answer

Naive MoE is a gather + many small per-expert GEMMs + a scatter — launch-bound and memory-bound. A fused kernel does routing/grouped-GEMM/combine in one (or few) kernels indexed by a sorted token→expert map, keeping tensor cores busy and removing launch overhead. It's decisive for MoE throughput (the profile in lab-02 shows the grouped GEMM dominating).

Q4. Expert parallelism vs tensor parallelism for MoE?

Model answer

EP places whole experts on different GPUs; tokens are shipped to their expert's GPU via all-to-all and back. It scales expert count cheaply but adds communication and load-balancing risk (a hot expert bottlenecks its GPU). TP shards each expert's weights across GPUs (per-layer all-reduce). Real deployments often use EP for the MoE layers and DP/TP for attention, since the two have different parallelism sweet spots.

Q5. What sets which GEMM kernel runs?

Model answer

The dtype/quant format (Phase 6) and hardware. CUTLASS/TRTLLM-GEN/CuTeDSL provide kernels specialized per precision (fp16/fp8/int4) and tiled to the GPU's memory hierarchy; a quantized weight needs the matching kernel (e.g. Marlin for INT4, scaled-mm for FP8). Mismatch is wrong or slow — that's why quant format and kernel are chosen together.

Rapid-fire

MoE step order? router → top-k → permute → grouped GEMM → un-permute → combine.
Why permute? contiguous per-expert work → big matmuls, not scattered tiny ones.
EP shards? whole experts (all-to-all). TP shards? each expert's weights.
Router cost? tiny; the experts (GEMM) dominate.
Famous open MoEs? Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS.

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 07 — Interview Questions: GEMM & MoE Kernels

Q1. What is MoE and why is it attractive?

Q2. Walk through the MoE forward.

Q3. Why fused MoE kernels?

Q4. Expert parallelism vs tensor parallelism for MoE?

Q5. What sets which GEMM kernel runs?

Rapid-fire