Phase 07 — Deep Dive: fused MoE in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0.vllm/model_executor/layers/fused_moe/layer.py FusedMoE nn.Module (the layer) vllm/model_executor/layers/fused_moe/fused_moe.py the Triton fused kernel + fused_experts vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py permute/un-permute vllm/model_executor/layers/fused_moe/moe_align_block_size.py align tokens to GEMM tiles vllm/model_executor/layers/fused_moe/fused_moe_method_base.py the method base (quant-aware) vllm/model_executor/models/mixtral.py a real MoE model vllm/model_executor/models/deepseek_v2.py DeepSeek MoE (shared experts + MLA)
Contents
- 1. The layer:
FusedMoE - 2. The fused kernel:
fused_moe.py - 3. Permute / align: making per-expert work contiguous
- 4. Routing (top-k) and combine
- 5. A real MoE model: Mixtral
- 6. Expert parallelism
- Reading checklist
1. The layer: FusedMoE
vllm/model_executor/layers/fused_moe/layer.py:73 — class FusedMoE(PluggableLayer). This is what
a model instantiates (see Mixtral below). It holds all experts' weights as stacked tensors
(shape roughly (E, ...)) and a quant method (Phase 6 — MoE weights are often quantized too). Its
forward (:1306) takes the hidden states and the router logits and returns the combined
output. It hides the routing + grouped GEMM + combine behind one call — the model just does
self.experts(hidden_states, router_logits).
2. The fused kernel: fused_moe.py
vllm/model_executor/layers/fused_moe/fused_moe.py:
fused_moe_kernel(:295) — the Triton kernel doing the grouped expert GEMM (one program per tile, looking up which expert a tile belongs to).fused_moe_kernel_gptq_awq(:61) is the quantized variant (Phase 6 formats need their own MoE kernel).fused_experts(:1587) /fused_experts_impl(:1664) — the host-side orchestration: align tokens to block size, run the kernel for the up/gate projection, apply the activation, run the down projection, and combine. Readfused_experts_implto see the full sequence — it's the guide's 6 steps in code.
The win vs naive: instead of a Python loop of E small matmuls, one kernel processes all tokens
for all experts, indexed by a sorted token→expert mapping. That's the "fused" in fused MoE.
3. Permute / align: making per-expert work contiguous
moe_align_block_size.py— sorts/pads tokens so each expert's tokens form contiguous, tile-aligned blocks the GEMM kernel can chew through efficiently. This is the practical form of the guide's "permute" step.moe_permute_unpermute.py— the explicit permute (group by expert) and un-permute (scatter back) used by some paths.
Either way the principle is the same: sort tokens by expert → big grouped GEMM → scatter back.
Your lab-01 does this with argsort, which is exactly the idea minus the tile alignment.
4. Routing (top-k) and combine
The router is a small linear (gate) producing (tokens, E) logits. Selecting top-k experts and
their normalized weights happens in the layer/kernel path (look for topk / select_experts in
layer.py and fused_moe.py). DeepSeek adds grouped top-k (group experts, pick groups first)
and shared experts (always-on experts added to every token) — see deepseek_v2.py. The
combine is a weighted sum of each token's k expert outputs by the gate weights.
5. A real MoE model: Mixtral
vllm/model_executor/models/mixtral.py:
class MixtralMoE(nn.Module)(:77) — buildsself.experts = FusedMoE(...)(:132) and agatelinear; its forward computesrouter_logits = gate(x)thenself.experts(x, router_logits)(:153). That's the entire MoE block — the complexity is insideFusedMoE. When you add a model (Phase 14), wiring an MoE layer is this small.
6. Expert parallelism
fused_moe/all2all_utils.py, prepare_finalize/, and expert_map_manager.py implement EP: an
expert-to-GPU map, the all-to-all that ships tokens to their expert's GPU and back, and load
handling. EP is configured alongside TP/DP (Phase 10). The key cost is the all-to-all + imbalance
when routing is skewed.
Reading checklist
-
FusedMoE.forward— what two things does it take, and what does it hide? -
fused_experts_impl— find the up/gate GEMM, activation, down GEMM, and combine. -
Why does
moe_align_block_sizeexist (contiguous, tile-aligned per-expert work)? -
In Mixtral, how few lines is the MoE block once
FusedMoEexists? - EP vs TP for MoE — what does each shard, and what communication does each imply?
Now build it: 02-mini-build.md, then the labs.