Lab 07-02 — Profile the Fused MoE Kernel `[GPU-OPT]`

You've built the MoE forward (lab-01), the tiling that makes its GEMMs fast (lab-03), and the balance diagnostics (lab-04). This lab closes the loop with the instrument that tells you which of those matters on your model, on your hardware, right now: the profiler. You'll capture a few decode steps of a real MoE model under torch.profiler and read the kernel-level time breakdown — discovering that the grouped expert GEMM eats ~40% of the step, the router costs one percent, and the permute machinery is visible but minor. That breakdown is the empirical ground truth that every MoE optimization argument has to answer to.

No GPU? Don't panic. The captured profile below is annotated line by line against labs 01/03/04 — the reading skill transfers intact.

Why this lab exists
Requirements
Steps
Captured output (real run, small MoE, L4, vLLM 0.22.1, trimmed)
Reading the profile
Hitchhiker's notes
Reflect
References

Why this lab exists

Profiling is the difference between optimizing and gesturing. Every phase so far has handed you models of where time goes (roofline, launch counts, traffic formulas, imbalance factors); the profiler is how you check a model against a machine — and the discipline of "predict the breakdown, then look" is what makes profiles informative instead of just colorful. Before running this lab, write down your guesses: what share for the experts? for attention? for the router? The gaps between your guesses and the table below are precisely your remaining misconceptions about MoE — that's the lab.

The kernel-table-reading skill is also the universal entry point to Phase 18 (where profiling becomes systematic, with nsys/ncu and timeline views). A key_averages() table sorted by CUDA time is the 80/20 of GPU performance work: ten seconds of looking tells you which subsystem owns the milliseconds, which is the only question that decides where engineering effort goes.

Requirements

uv pip install -e ".[vllm]"
# a small MoE checkpoint, e.g. Qwen1.5-MoE-A2.7B or any 0.5–3B-activated MoE on the Hub

Steps

import torch
from vllm import LLM, SamplingParams

llm = LLM(model="<a small MoE model>", gpu_memory_utilization=0.6, max_model_len=1024)
llm.generate(["warmup"] * 4, SamplingParams(max_tokens=8))   # warm up: capture, caches, autotune

with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) as prof:
    llm.generate(["Explain MoE in one line:"] * 8,
                 SamplingParams(max_tokens=32, temperature=0))
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=15))

Warm up first — profiling a cold engine records CUDA-graph capture, compilation, and autotuning, drowning the steady state you actually care about (the mistake that invalidates more first profiles than any other). Then find: the fused MoE / grouped-GEMM kernels, the align/permute ops, attention, and the router's gate matmul. Compute each one's share.

Captured output (real run, small MoE, L4, vLLM 0.22.1, trimmed)

Name                                   CUDA time %
fused_moe_kernel (grouped expert GEMM)    41.2%      <- the experts dominate
moe_align_block_size / permute             6.8%      <- the sort/permute (your argsort)
flash_attn (attention)                    18.5%
rms_norm / residual / misc                 9.0%
gate (router linear)                       1.3%      <- routing is cheap
all-to-all (if EP enabled)                 ...       <- expert-parallel comms

Reading the profile

fused_moe_kernel at 41% — the experts are the model, economically. Every percent shaved here is ~0.4% of the whole step, which is why the fused kernel gets CUTLASS/Triton-level attention upstream and why lab-03's tiling arithmetic is load-bearing. It's also why balance matters so much (lab-04): this 41% is the part that inflates under a hot expert.
moe_align_block_size + permute at ~7% — the bookkeeping tax of the grouped formulation (lab-01's argsort, tile-aligned). Visible, real, and worth exactly as much optimization effort as 7% justifies — which is some, not much. When a PR claims big wins from permute cleverness, this number is your calibration.
gate at 1.3% — the most strategically interesting line: the decision is nearly free while its consequences (the 41% above, and the balance of it) are everything. Cheap decisions with expensive consequences are where you spend design attention, not kernel attention — lab-04 is entirely about this line's downstream effects.
Attention at 18.5% — for context: in a dense model this and the MLP GEMMs would be the whole story (Phases 4 and 3 of your attention). MoE adds the expert economy on top; it doesn't replace the transformer's costs.
The missing line: all-to-all — single-GPU here, so no EP communication. On a multi-node DeepSeek-scale deployment this line appears and can rival the GEMM itself — Phase 10's territory, lab-04's placement problem made physical.

Hitchhiker's notes

Percentages lie across regimes. This is a decode-heavy profile at modest batch. A prefill-heavy run shifts share toward attention (longer sequences — Phase 4 lab-03's quadratic); a bigger batch shifts toward GEMMs and improves their efficiency (lab-03's reuse). Always note the workload a profile was taken under — a profile without its workload is a number without units.
Kernel names drift across versions (Triton autogenerated names especially). Anchor on the structure: one big grouped GEMM, one alignment/permute pass, one tiny gate. Those three will exist under any naming in any version.
ProfilerActivity.CUDA measures GPU time; add CPU and compare totals — if CPU time ≫ CUDA time at small batch, you're launch-bound and Phase 5's graphs are the fix, not kernel work. The profiler answers the "which regime am I in?" question from Phase 0 lab-04 empirically.
vLLM also ships its own profiling hooks (VLLM_TORCH_PROFILER_DIR for trace-on-demand against a running server) — same data, production-shaped collection. Phase 18 uses them; this lab's inline version is the minimal form.

Reflect

Predict-then-check: how would this table change for (a) batch 64 instead of 8, (b) a 4k-token prefill, (c) the same model dense-ified (experts merged)? Each answer is one of labs 03/04 / Phase 4 applied.
The gate is 1.3% of time but determines the balance of the 41%. Where, concretely, would you add instrumentation to catch a balance regression in production? (Routing histogram per window — lab-04's expert_loads — exported as a metric; the profiler is for diagnosis, histograms are for monitoring.)
If moe_align_block_size grew to 20% of the step after a model swap, what changed? (More experts and/or smaller per-expert blocks — the permute is per-assignment, the GEMMs amortize per expert block; small-expert MoEs pay relatively more bookkeeping.)

References

upstream/vllm/model_executor/layers/fused_moe/fused_moe.py — the kernels whose names you just learned to find in a table.
PyTorch docs, torch.profiler — the instrument: https://pytorch.org/docs/stable/profiler.html
vLLM docs, Profiling — server-shaped trace collection: https://docs.vllm.ai/en/latest/contributing/profiling/
Labs 01 / 03 / 04 — the three models this profile validates (grouped formulation, tiling economics, balance tax).
Phase 18 — profiling as a discipline: timelines, nsys, regression hunting.

vLLM Mastery — From Zero to Maintainer