Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 07-04 — Expert Load Balance: the MoE Serving Problem [CPU-OK]

Lab-01 built the MoE forward and treated the router's decisions as given. This lab asks the operator's question: what do those decisions cost? A mixture-of-experts model's economics rest on a promise — E experts' worth of capacity for k experts' worth of compute per token — and that promise has fine print: it holds only if tokens spread evenly. Real routers don't spread evenly. You'll build the three numbers that quantify the damage: per-expert loads, the imbalance factor, and — the one that costs money — EP step time, which with experts sharded across devices equals the busiest device's load, not the average. Same total work, one hot expert, >2.5× the step time: you'll prove it in an assert.

Contents


Why this lab exists

MoE models (Mixtral, DeepSeek-V3, Qwen-MoE, the frontier generally) are taking over serving fleets, and their performance pathologies are distributional, not computational: nothing crashes, no kernel is slow — the work just lands unevenly and silicon idles in the gaps. An engineer staring at "MoE deployment at 40% of expected throughput" needs exactly the diagnostics you're building: dump the routing histogram, compute imbalance, map experts to devices, find the hot device. The lab's deliberately crafted "hot expert" router (60% of assignments to one expert) is not a strawman — it's the documented failure shape of undertrained gates, domain-shifted traffic (code-heavy prompts lighting up code-ish experts), and repetitive workloads.

The second reason: capacity factors. Training-era MoE systems used fixed per-expert buffers and dropped overflow tokens — fine for training (a dropped token is a slightly noisier gradient), catastrophic for inference (a dropped token is a corrupted generation). Understanding dropped_tokens is understanding why inference MoE never drops — and what it pays instead (dynamic buffers, the full imbalance tax landing on latency). That design fork explains a lot of otherwise-puzzling differences between training and serving MoE stacks.

Background: why imbalance is a tax on parallelism

With expert parallelism (EP), experts live on different devices and every step runs all-to-all: tokens ship to their experts' devices, compute happens, results ship back. The step completes when the last device finishes — a barrier. So step time is max(device loads), while useful work is sum(loads). Parallel efficiency is their ratio scaled by device count:

efficiency = sum(loads) / (num_devices × max(device_load))

Perfectly balanced: efficiency 1. One expert carrying 60% of assignments on an 8-device layout: the hot device defines the step while seven others idle most of it — your test_imbalance_burns_parallel_efficiency measures >2.5× step inflation at identical total work. Note what this is, structurally: a straggler problem, the same shape as Phase 3 lab-05's prefill spike (one slow element holds the barrier) and the tail-at-scale phenomenon generally. Distribution problems all rhyme.

The mitigation toolbox (each one is a knob in real systems): auxiliary load-balancing losses at training time (bake balance into the router), expert placement (don't put two historically-hot experts on one device — your e % num_devices round-robin is the naive baseline placement), redundant replicas of hot experts (vLLM's EPLB — expert parallel load balancer — does exactly this), and shared experts (DeepSeek's always-active expert absorbs the common patterns so the routed ones stay balanced).

Files

  • starter.pyexpert_loads, imbalance, ep_step_time, dropped_tokens. Your work.
  • solution.py — reference.
  • test_lab.py — counting, the uniform baseline, the hot-expert blowup, max-device step time, the efficiency burn, and capacity-overflow accounting.

Run

LAB_IMPL=starter pytest phase-07-gemm-and-moe-kernels/labs/lab-04-expert-load-balance -q
pytest phase-07-gemm-and-moe-kernels/labs/lab-04-expert-load-balance -q   # reference

What the tests prove

TestWhat it pins
test_loads_count_assignmentsThe histogram itself — note it counts (token, expert) assignments, so a token routed to expert 1 twice in top-k counts twice (it really does cost two expert-rows of compute)
test_uniform_routing_is_nearly_balancedThe healthy baseline: random routing lands imbalance < 1.3 — your reference point for "what good looks like" on a histogram
test_hot_expert_blows_up_imbalanceThe pathology: 60% to one expert → imbalance > 3 (~5× its fair share)
test_ep_step_time_is_the_max_deviceThe barrier semantics, including the subtle case: with fewer devices than experts, co-located experts' loads add — placement matters, not just routing
test_imbalance_burns_parallel_efficiencyThe money assert: identical sum(loads), >2.5× the step time. Throughput lost to distribution, with zero slow code anywhere
test_capacity_drops_only_the_overflow / _cf1Capacity-factor mechanics: cf=1.0 with a hot expert drops 68 of 128 assignments; cf high enough drops none. The training-era trade inference refuses

Hitchhiker's notes

  • Why does top-k routing make this worse than it looks? Each token takes k experts, so a hot expert co-occurs with others — you can't fix a hot expert by moving its tokens without also touching their second choices. Balance is a joint property of the whole routing matrix, which is why post-hoc fixes (placement, replication) are often easier than router surgery.
  • EPLB in one sentence: measure loads over a window, then replicate hot experts across devices and route their traffic round-robin among replicas — trading memory (extra expert copies) for balance. Find it upstream (vllm/distributed/eplb/); your ep_step_time is the objective function it's minimizing.
  • The all-to-all itself (not modeled here) adds a second imbalance-sensitive cost: hot experts concentrate network traffic too. On NVLink-rich nodes it's tolerable; across nodes it's why DeepSeek-V3's deployment papers obsess over expert placement. Phase 10 picks this up.
  • Decode vs prefill see different distributions. A prefill batch routes thousands of tokens (law of large numbers smooths loads); a decode batch routes batch × k assignments — at batch 8, k 2, that's 16 samples over maybe 64 experts: structurally lumpy even with a perfect router. Small-batch MoE decode is imbalanced by arithmetic, not pathology — one more reason MoE wants large serving batches (and why test_uniform_routing_is_nearly_balanced uses 1600 assignments, not 16).

Going further

  • Implement best_placement(loads, num_devices) — greedy bin-packing (sort descending, assign to lightest device) — and measure how much it recovers vs round-robin on the skewed router. Then add replication: one extra copy of the hottest expert, traffic split — compare. You've now built EPLB's two levers.
  • Sweep batch size from 4 to 4096 with the uniform router and plot imbalance: watch the small-batch lumpiness decay as 1/√batch. This curve is why MoE throughput benchmarks at tiny batch are misleading.
  • Add a shared-expert column (every token also visits expert S, DeepSeek-style) and check what it does to imbalance among the routed experts when the router is hot. (Hint: it doesn't fix routing — it shrinks the stakes per routed token.)

References

  • Fedus et al., Switch Transformers (2021) — capacity factor, token dropping, aux losses: the training-era trade space: https://arxiv.org/abs/2101.03961
  • DeepSeek-AI, DeepSeek-V3 Technical Report (2024) — shared experts, auxiliary-loss-free balancing, and deployment-grade EP placement: https://arxiv.org/abs/2412.19437
  • upstream/vllm/distributed/eplb/ — vLLM's expert-parallel load balancer; your ep_step_time is its loss function.
  • upstream/vllm/model_executor/layers/fused_moe/ — where loads meet kernels (moe_align_block_size and friends, lab-01's mapping).
  • Dean & Barroso, The Tail at Scale — the straggler pattern this is an instance of: https://research.google/pubs/the-tail-at-scale/