Lab 06-04 — Activation Outliers & the SmoothQuant Migration [CPU-OK]
Labs 01 and 03 quantized weights — tensors you can study offline at your leisure, with any granularity of scales you fancy. This lab quantizes activations, and activations fight back. They exist only at runtime (scales must be cheap — per-tensor or per-token, not per-group), and in real LLMs they carry a famous pathology: a handful of channels run 10–100× louder than the rest, consistently, across all inputs — a trained-in fact of transformer feature geometry, not noise. One per-tensor scale set by the loudest channel crushes everyone else's resolution, and W8A8 accuracy falls off a cliff. You'll reproduce the cliff, then implement the elegant fix from SmoothQuant: you can't delete the outliers, but you can relocate them — migrate magnitude from the hard-to-quantize activations into the easy-to-quantize weights, via a reparametrization that is mathematically a no-op.
Contents
- Why this lab exists
- Background: the outlier problem and the migration
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This lab answers the question lab-02's GPU numbers raise but don't explain: why is
quantization="fp8" (W8A8) a different kind of thing than loading an AWQ checkpoint
(W4A16)? Weight-only quant shrinks bytes and leaves all computation in high precision —
its only failure mode is weight rounding error, which labs 01/03 showed is tame. W8A8
additionally runs the matmul itself in 8-bit (unlocking FP8/INT8 tensor cores — the
throughput jump in lab-02's capture), which means activations must survive quantization
too — and they're the hostile party. Every production decision between "fp8 for speed"
and "AWQ for memory" is downstream of the asymmetry you'll measure here.
The deeper lesson is the shape of SmoothQuant's fix, because you'll reuse it forever:
when a hard constraint can't be removed, look for a reparametrization that moves the
difficulty to where you have better tools. Activations only afford one cheap scale;
weights afford per-channel scales (lab-01) that eat outliers for breakfast. So divide
each activation channel by s_j, multiply the matching weight column by s_j, and the
product is bit-for-bit the same function — but the loudness now lives in the weights,
where per-channel scales neutralize it. No retraining, no approximation in the transform
itself. The only approximation remains the quantization, now applied to friendlier
tensors.
Background: the outlier problem and the migration
Symmetric per-tensor int8: scale = max|X| / 127. With a channel 80× louder than the
rest, the quiet channels — which carry most of the information — get
127 / 80 ≈ 1.6 effective levels. Their contribution to the matmul turns to gravel.
That's the cliff (test_outliers_wreck_naive_w8a8: >5% relative matmul error from one
setup; real perplexity explodes the same way — the LLM.int8() paper documents the
phenomenon at scale).
The migration, per input channel j (SmoothQuant eq. 4):
s_j = max|X[:, j]|^α / max|W[:, j]|^(1−α)
X̂[:, j] = X[:, j] / s_j Ŵ[:, j] = W[:, j] · s_j X̂ Ŵᵀ ≡ X Wᵀ
α splits the difficulty: α = 1 dumps all activation loudness into the weights
(overloading their quantizer), α = 0 does nothing; α ≈ 0.5 balances — equalizing the
per-channel max ratios of both tensors. In practice s is computed once offline from
calibration activations and folded into the previous layer's weights (LayerNorm gain
or prior linear), so runtime sees zero extra ops. The smoothing is free at inference;
that's why it shipped everywhere.
Files
starter.py—quantize_per_tensor,fake_quant,w8a8_matmul,smooth. Your work.solution.py— reference.test_lab.py— exactness of the reparametrization, the cliff, the rescue, the no-outlier control arm, and proof the magnitude actually moved.
Run
LAB_IMPL=starter pytest phase-06-quantization/labs/lab-04-activation-outliers-smoothquant -q
pytest phase-06-quantization/labs/lab-04-activation-outliers-smoothquant -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_smoothing_is_mathematically_exact | X̂ Ŵᵀ = X Wᵀ to 1e-10 — the migration is a reparametrization, not an approximation. Establish this before measuring anything else (the experimental hygiene point: separate the exact transform from the lossy quantization, or you can't attribute the error) |
test_outliers_wreck_naive_w8a8 | The cliff: two loud channels out of 256 push matmul error past 5% |
test_smoothing_rescues_w8a8 | The headline: same inputs, error drops > 3× (typically ~10×) after migration — the SmoothQuant result, reproduced in 30 lines |
test_no_outliers_means_little_to_gain | The control arm: tame activations quantize fine raw (< 2% error), and smoothing changes ~nothing. The fix targets a specific pathology; on healthy tensors it's inert — which is exactly what you want from an always-on transform |
test_migration_actually_moved_the_magnitude | Mechanism check, not just outcome: X's loudest-to-median channel ratio collapses > 5×, W's max grows. The where it went of the migration |
Hitchhiker's notes
- Why are the outliers there at all? They emerge during training in large
transformers (documented from ~6.7B up, LLM.int8() §3) and appear to function as
attention/no-op signaling channels — removing them lobotomizes the model. They're also
stable: the same channels are loud across inputs, which is precisely what makes
offline calibration of
spossible. A pathology you can calibrate against is an engineering problem; one that moves per-input would have been fatal to W8A8. - Per-token activation scales (one scale per row of X, computed on the fly) are the
other standard mitigation, and what vLLM's fp8 "dynamic" mode does — they handle
token-loudness but not channel-loudness (the scale is still shared across the
row's channels), which is why smoothing and per-token scales compose rather than
compete. Check
upstream/vllm/model_executor/layers/quantization/fp8.py— the per-tensor vs per-token vs static-scale plumbing inFp8LinearMethodis this exact taxonomy in code. - FP8 (e4m3) changes the constants, not the story. Floating-point 8-bit has more dynamic range than int8 (exponent bits), so the cliff is shallower — outliers cost precision rather than annihilating it. Hopper's FP8 tensor cores made W8A8 the default "fast mode"; the outlier discipline is why it usually just works now. The analysis you did here is why it sometimes doesn't (extreme models, exotic layers), and what to reach for then.
- Folding
sinto the previous layer is the production detail worth savoring: the division bysbecomes part of the LayerNorm weights, the multiplication lives in the quantized checkpoint. The runtime graph is identical to the unsmoothed model's. When you diff a SmoothQuant checkpoint against its base, all you see is slightly different numbers — the entire technique hides in plain sight.
Going further
- Sweep
α ∈ {0, 0.25, 0.5, 0.75, 1.0}on the outlier setup and plot W8A8 error. You'll see the U: α too low leaves X hard, too high makes W hard. The paper's 0.5 default is the bottom for typical magnitude ratios — find a setup where 0.75 wins (hint: make the weights unusually tame). - Implement per-token activation scales (
scale_i = max|X[i]| / 127per row) and compare: per-token alone vs smoothing alone vs both, on the outlier setup. Reproduces the design space the fp8 backends actually navigate. - Add a "quantize the smoothed weights with lab-01's per-channel int8" step and verify end-to-end W8A8 error lands near the fp baseline — you've now composed three labs into the actual SmoothQuant pipeline.
References
- Xiao et al., SmoothQuant: Accurate and Efficient Post-Training Quantization for Large
Language Models (2022) — the migration, eq. 4 is your
smooth: https://arxiv.org/abs/2211.10438 - Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022) — the outlier phenomenon, documented and dissected: https://arxiv.org/abs/2208.07339
- NVIDIA, FP8 Formats for Deep Learning (2022) — why e4m3's range softens the cliff: https://arxiv.org/abs/2209.05433
upstream/vllm/model_executor/layers/quantization/fp8.py—Fp8LinearMethod: the scale-mode taxonomy (static/dynamic, per-tensor/per-token) in production form.- Lab-02 — the GPU measurements this lab explains; lab-01 — the per-channel weight quantizer the migration relies on.