Lab 07-04 — Expert Load Balance: the MoE Serving Problem [CPU-OK]
Lab-01 built the MoE forward and treated the router's decisions as given. This lab asks the operator's question: what do those decisions cost? A mixture-of-experts model's economics rest on a promise — E experts' worth of capacity for k experts' worth of compute per token — and that promise has fine print: it holds only if tokens spread evenly. Real routers don't spread evenly. You'll build the three numbers that quantify the damage: per-expert loads, the imbalance factor, and — the one that costs money — EP step time, which with experts sharded across devices equals the busiest device's load, not the average. Same total work, one hot expert, >2.5× the step time: you'll prove it in an assert.
Contents
- Why this lab exists
- Background: why imbalance is a tax on parallelism
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
MoE models (Mixtral, DeepSeek-V3, Qwen-MoE, the frontier generally) are taking over
serving fleets, and their performance pathologies are distributional, not
computational: nothing crashes, no kernel is slow — the work just lands unevenly and
silicon idles in the gaps. An engineer staring at "MoE deployment at 40% of expected
throughput" needs exactly the diagnostics you're building: dump the routing histogram,
compute imbalance, map experts to devices, find the hot device. The lab's deliberately
crafted "hot expert" router (60% of assignments to one expert) is not a strawman — it's
the documented failure shape of undertrained gates, domain-shifted traffic
(code-heavy prompts lighting up code-ish experts), and repetitive workloads.
The second reason: capacity factors. Training-era MoE systems used fixed per-expert
buffers and dropped overflow tokens — fine for training (a dropped token is a slightly
noisier gradient), catastrophic for inference (a dropped token is a corrupted
generation). Understanding dropped_tokens is understanding why inference MoE never
drops — and what it pays instead (dynamic buffers, the full imbalance tax landing on
latency). That design fork explains a lot of otherwise-puzzling differences between
training and serving MoE stacks.
Background: why imbalance is a tax on parallelism
With expert parallelism (EP), experts live on different devices and every step runs
all-to-all: tokens ship to their experts' devices, compute happens, results ship
back. The step completes when the last device finishes — a barrier. So step time is
max(device loads), while useful work is sum(loads). Parallel efficiency is their
ratio scaled by device count:
efficiency = sum(loads) / (num_devices × max(device_load))
Perfectly balanced: efficiency 1. One expert carrying 60% of assignments on an 8-device
layout: the hot device defines the step while seven others idle most of it — your
test_imbalance_burns_parallel_efficiency measures >2.5× step inflation at identical
total work. Note what this is, structurally: a straggler problem, the same shape as
Phase 3 lab-05's prefill spike (one slow element holds the barrier) and the tail-at-scale
phenomenon generally. Distribution problems all rhyme.
The mitigation toolbox (each one is a knob in real systems): auxiliary load-balancing
losses at training time (bake balance into the router), expert placement (don't put
two historically-hot experts on one device — your e % num_devices round-robin is the
naive baseline placement), redundant replicas of hot experts (vLLM's EPLB — expert
parallel load balancer — does exactly this), and shared experts (DeepSeek's
always-active expert absorbs the common patterns so the routed ones stay balanced).
Files
starter.py—expert_loads,imbalance,ep_step_time,dropped_tokens. Your work.solution.py— reference.test_lab.py— counting, the uniform baseline, the hot-expert blowup, max-device step time, the efficiency burn, and capacity-overflow accounting.
Run
LAB_IMPL=starter pytest phase-07-gemm-and-moe-kernels/labs/lab-04-expert-load-balance -q
pytest phase-07-gemm-and-moe-kernels/labs/lab-04-expert-load-balance -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_loads_count_assignments | The histogram itself — note it counts (token, expert) assignments, so a token routed to expert 1 twice in top-k counts twice (it really does cost two expert-rows of compute) |
test_uniform_routing_is_nearly_balanced | The healthy baseline: random routing lands imbalance < 1.3 — your reference point for "what good looks like" on a histogram |
test_hot_expert_blows_up_imbalance | The pathology: 60% to one expert → imbalance > 3 (~5× its fair share) |
test_ep_step_time_is_the_max_device | The barrier semantics, including the subtle case: with fewer devices than experts, co-located experts' loads add — placement matters, not just routing |
test_imbalance_burns_parallel_efficiency | The money assert: identical sum(loads), >2.5× the step time. Throughput lost to distribution, with zero slow code anywhere |
test_capacity_drops_only_the_overflow / _cf1 | Capacity-factor mechanics: cf=1.0 with a hot expert drops 68 of 128 assignments; cf high enough drops none. The training-era trade inference refuses |
Hitchhiker's notes
- Why does top-k routing make this worse than it looks? Each token takes k experts, so a hot expert co-occurs with others — you can't fix a hot expert by moving its tokens without also touching their second choices. Balance is a joint property of the whole routing matrix, which is why post-hoc fixes (placement, replication) are often easier than router surgery.
- EPLB in one sentence: measure loads over a window, then replicate hot experts
across devices and route their traffic round-robin among replicas — trading memory
(extra expert copies) for balance. Find it upstream (
vllm/distributed/eplb/); yourep_step_timeis the objective function it's minimizing. - The all-to-all itself (not modeled here) adds a second imbalance-sensitive cost: hot experts concentrate network traffic too. On NVLink-rich nodes it's tolerable; across nodes it's why DeepSeek-V3's deployment papers obsess over expert placement. Phase 10 picks this up.
- Decode vs prefill see different distributions. A prefill batch routes thousands of
tokens (law of large numbers smooths loads); a decode batch routes
batch × kassignments — at batch 8, k 2, that's 16 samples over maybe 64 experts: structurally lumpy even with a perfect router. Small-batch MoE decode is imbalanced by arithmetic, not pathology — one more reason MoE wants large serving batches (and whytest_uniform_routing_is_nearly_balanceduses 1600 assignments, not 16).
Going further
- Implement
best_placement(loads, num_devices)— greedy bin-packing (sort descending, assign to lightest device) — and measure how much it recovers vs round-robin on the skewed router. Then add replication: one extra copy of the hottest expert, traffic split — compare. You've now built EPLB's two levers. - Sweep batch size from 4 to 4096 with the uniform router and plot imbalance: watch the small-batch lumpiness decay as 1/√batch. This curve is why MoE throughput benchmarks at tiny batch are misleading.
- Add a shared-expert column (every token also visits expert S, DeepSeek-style) and check what it does to imbalance among the routed experts when the router is hot. (Hint: it doesn't fix routing — it shrinks the stakes per routed token.)
References
- Fedus et al., Switch Transformers (2021) — capacity factor, token dropping, aux losses: the training-era trade space: https://arxiv.org/abs/2101.03961
- DeepSeek-AI, DeepSeek-V3 Technical Report (2024) — shared experts, auxiliary-loss-free balancing, and deployment-grade EP placement: https://arxiv.org/abs/2412.19437
upstream/vllm/distributed/eplb/— vLLM's expert-parallel load balancer; yourep_step_timeis its loss function.upstream/vllm/model_executor/layers/fused_moe/— where loads meet kernels (moe_align_block_sizeand friends, lab-01's mapping).- Dean & Barroso, The Tail at Scale — the straggler pattern this is an instance of: https://research.google/pubs/the-tail-at-scale/