Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 07 — Deep Dive: fused MoE in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0.

vllm/model_executor/layers/fused_moe/layer.py             FusedMoE nn.Module (the layer)
vllm/model_executor/layers/fused_moe/fused_moe.py         the Triton fused kernel + fused_experts
vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py   permute/un-permute
vllm/model_executor/layers/fused_moe/moe_align_block_size.py    align tokens to GEMM tiles
vllm/model_executor/layers/fused_moe/fused_moe_method_base.py   the method base (quant-aware)
vllm/model_executor/models/mixtral.py                     a real MoE model
vllm/model_executor/models/deepseek_v2.py                 DeepSeek MoE (shared experts + MLA)

Contents


1. The layer: FusedMoE

vllm/model_executor/layers/fused_moe/layer.py:73class FusedMoE(PluggableLayer). This is what a model instantiates (see Mixtral below). It holds all experts' weights as stacked tensors (shape roughly (E, ...)) and a quant method (Phase 6 — MoE weights are often quantized too). Its forward (:1306) takes the hidden states and the router logits and returns the combined output. It hides the routing + grouped GEMM + combine behind one call — the model just does self.experts(hidden_states, router_logits).

2. The fused kernel: fused_moe.py

vllm/model_executor/layers/fused_moe/fused_moe.py:

  • fused_moe_kernel (:295) — the Triton kernel doing the grouped expert GEMM (one program per tile, looking up which expert a tile belongs to). fused_moe_kernel_gptq_awq (:61) is the quantized variant (Phase 6 formats need their own MoE kernel).
  • fused_experts (:1587) / fused_experts_impl (:1664) — the host-side orchestration: align tokens to block size, run the kernel for the up/gate projection, apply the activation, run the down projection, and combine. Read fused_experts_impl to see the full sequence — it's the guide's 6 steps in code.

The win vs naive: instead of a Python loop of E small matmuls, one kernel processes all tokens for all experts, indexed by a sorted token→expert mapping. That's the "fused" in fused MoE.

3. Permute / align: making per-expert work contiguous

  • moe_align_block_size.py — sorts/pads tokens so each expert's tokens form contiguous, tile-aligned blocks the GEMM kernel can chew through efficiently. This is the practical form of the guide's "permute" step.
  • moe_permute_unpermute.py — the explicit permute (group by expert) and un-permute (scatter back) used by some paths.

Either way the principle is the same: sort tokens by expert → big grouped GEMM → scatter back. Your lab-01 does this with argsort, which is exactly the idea minus the tile alignment.

4. Routing (top-k) and combine

The router is a small linear (gate) producing (tokens, E) logits. Selecting top-k experts and their normalized weights happens in the layer/kernel path (look for topk / select_experts in layer.py and fused_moe.py). DeepSeek adds grouped top-k (group experts, pick groups first) and shared experts (always-on experts added to every token) — see deepseek_v2.py. The combine is a weighted sum of each token's k expert outputs by the gate weights.

5. A real MoE model: Mixtral

vllm/model_executor/models/mixtral.py:

  • class MixtralMoE(nn.Module) (:77) — builds self.experts = FusedMoE(...) (:132) and a gate linear; its forward computes router_logits = gate(x) then self.experts(x, router_logits) (:153). That's the entire MoE block — the complexity is inside FusedMoE. When you add a model (Phase 14), wiring an MoE layer is this small.

6. Expert parallelism

fused_moe/all2all_utils.py, prepare_finalize/, and expert_map_manager.py implement EP: an expert-to-GPU map, the all-to-all that ships tokens to their expert's GPU and back, and load handling. EP is configured alongside TP/DP (Phase 10). The key cost is the all-to-all + imbalance when routing is skewed.

Reading checklist

  • FusedMoE.forward — what two things does it take, and what does it hide?
  • fused_experts_impl — find the up/gate GEMM, activation, down GEMM, and combine.
  • Why does moe_align_block_size exist (contiguous, tile-aligned per-expert work)?
  • In Mixtral, how few lines is the MoE block once FusedMoE exists?
  • EP vs TP for MoE — what does each shard, and what communication does each imply?

Now build it: 02-mini-build.md, then the labs.