Lab 07-02 — Profile the Fused MoE Kernel [GPU-OPT]
You've built the MoE forward (lab-01), the tiling that makes its GEMMs fast (lab-03),
and the balance diagnostics (lab-04). This lab closes the loop with the instrument that
tells you which of those matters on your model, on your hardware, right now: the
profiler. You'll capture a few decode steps of a real MoE model under
torch.profiler and read the kernel-level time breakdown — discovering that the
grouped expert GEMM eats ~40% of the step, the router costs one percent, and the
permute machinery is visible but minor. That breakdown is the empirical ground truth
that every MoE optimization argument has to answer to.
No GPU? Don't panic. The captured profile below is annotated line by line against labs 01/03/04 — the reading skill transfers intact.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, small MoE, L4, vLLM 0.22.1, trimmed)
- Reading the profile
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Profiling is the difference between optimizing and gesturing. Every phase so far has handed you models of where time goes (roofline, launch counts, traffic formulas, imbalance factors); the profiler is how you check a model against a machine — and the discipline of "predict the breakdown, then look" is what makes profiles informative instead of just colorful. Before running this lab, write down your guesses: what share for the experts? for attention? for the router? The gaps between your guesses and the table below are precisely your remaining misconceptions about MoE — that's the lab.
The kernel-table-reading skill is also the universal entry point to Phase 18 (where
profiling becomes systematic, with nsys/ncu and timeline views). A key_averages()
table sorted by CUDA time is the 80/20 of GPU performance work: ten seconds of looking
tells you which subsystem owns the milliseconds, which is the only question that decides
where engineering effort goes.
Requirements
uv pip install -e ".[vllm]"
# a small MoE checkpoint, e.g. Qwen1.5-MoE-A2.7B or any 0.5–3B-activated MoE on the Hub
Steps
import torch
from vllm import LLM, SamplingParams
llm = LLM(model="<a small MoE model>", gpu_memory_utilization=0.6, max_model_len=1024)
llm.generate(["warmup"] * 4, SamplingParams(max_tokens=8)) # warm up: capture, caches, autotune
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) as prof:
llm.generate(["Explain MoE in one line:"] * 8,
SamplingParams(max_tokens=32, temperature=0))
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=15))
Warm up first — profiling a cold engine records CUDA-graph capture, compilation, and autotuning, drowning the steady state you actually care about (the mistake that invalidates more first profiles than any other). Then find: the fused MoE / grouped-GEMM kernels, the align/permute ops, attention, and the router's gate matmul. Compute each one's share.
Captured output (real run, small MoE, L4, vLLM 0.22.1, trimmed)
Name CUDA time %
fused_moe_kernel (grouped expert GEMM) 41.2% <- the experts dominate
moe_align_block_size / permute 6.8% <- the sort/permute (your argsort)
flash_attn (attention) 18.5%
rms_norm / residual / misc 9.0%
gate (router linear) 1.3% <- routing is cheap
all-to-all (if EP enabled) ... <- expert-parallel comms
Reading the profile
fused_moe_kernelat 41% — the experts are the model, economically. Every percent shaved here is ~0.4% of the whole step, which is why the fused kernel gets CUTLASS/Triton-level attention upstream and why lab-03's tiling arithmetic is load-bearing. It's also why balance matters so much (lab-04): this 41% is the part that inflates under a hot expert.moe_align_block_size+ permute at ~7% — the bookkeeping tax of the grouped formulation (lab-01'sargsort, tile-aligned). Visible, real, and worth exactly as much optimization effort as 7% justifies — which is some, not much. When a PR claims big wins from permute cleverness, this number is your calibration.gateat 1.3% — the most strategically interesting line: the decision is nearly free while its consequences (the 41% above, and the balance of it) are everything. Cheap decisions with expensive consequences are where you spend design attention, not kernel attention — lab-04 is entirely about this line's downstream effects.- Attention at 18.5% — for context: in a dense model this and the MLP GEMMs would be the whole story (Phases 4 and 3 of your attention). MoE adds the expert economy on top; it doesn't replace the transformer's costs.
- The missing line: all-to-all — single-GPU here, so no EP communication. On a multi-node DeepSeek-scale deployment this line appears and can rival the GEMM itself — Phase 10's territory, lab-04's placement problem made physical.
Hitchhiker's notes
- Percentages lie across regimes. This is a decode-heavy profile at modest batch. A prefill-heavy run shifts share toward attention (longer sequences — Phase 4 lab-03's quadratic); a bigger batch shifts toward GEMMs and improves their efficiency (lab-03's reuse). Always note the workload a profile was taken under — a profile without its workload is a number without units.
- Kernel names drift across versions (Triton autogenerated names especially). Anchor on the structure: one big grouped GEMM, one alignment/permute pass, one tiny gate. Those three will exist under any naming in any version.
ProfilerActivity.CUDAmeasures GPU time; addCPUand compare totals — if CPU time ≫ CUDA time at small batch, you're launch-bound and Phase 5's graphs are the fix, not kernel work. The profiler answers the "which regime am I in?" question from Phase 0 lab-04 empirically.- vLLM also ships its own profiling hooks (
VLLM_TORCH_PROFILER_DIRfor trace-on-demand against a running server) — same data, production-shaped collection. Phase 18 uses them; this lab's inline version is the minimal form.
Reflect
- Predict-then-check: how would this table change for (a) batch 64 instead of 8, (b) a 4k-token prefill, (c) the same model dense-ified (experts merged)? Each answer is one of labs 03/04 / Phase 4 applied.
- The gate is 1.3% of time but determines the balance of the 41%. Where, concretely,
would you add instrumentation to catch a balance regression in production? (Routing
histogram per window — lab-04's
expert_loads— exported as a metric; the profiler is for diagnosis, histograms are for monitoring.) - If
moe_align_block_sizegrew to 20% of the step after a model swap, what changed? (More experts and/or smaller per-expert blocks — the permute is per-assignment, the GEMMs amortize per expert block; small-expert MoEs pay relatively more bookkeeping.)
References
upstream/vllm/model_executor/layers/fused_moe/fused_moe.py— the kernels whose names you just learned to find in a table.- PyTorch docs, torch.profiler — the instrument: https://pytorch.org/docs/stable/profiler.html
- vLLM docs, Profiling — server-shaped trace collection: https://docs.vllm.ai/en/latest/contributing/profiling/
- Labs 01 / 03 / 04 — the three models this profile validates (grouped formulation, tiling economics, balance tax).
- Phase 18 — profiling as a discipline: timelines,
nsys, regression hunting.