Lab 06-04 — Activation Outliers & the SmoothQuant Migration `[CPU-OK]`

Labs 01 and 03 quantized weights — tensors you can study offline at your leisure, with any granularity of scales you fancy. This lab quantizes activations, and activations fight back. They exist only at runtime (scales must be cheap — per-tensor or per-token, not per-group), and in real LLMs they carry a famous pathology: a handful of channels run 10–100× louder than the rest, consistently, across all inputs — a trained-in fact of transformer feature geometry, not noise. One per-tensor scale set by the loudest channel crushes everyone else's resolution, and W8A8 accuracy falls off a cliff. You'll reproduce the cliff, then implement the elegant fix from SmoothQuant: you can't delete the outliers, but you can relocate them — migrate magnitude from the hard-to-quantize activations into the easy-to-quantize weights, via a reparametrization that is mathematically a no-op.

Why this lab exists
Background: the outlier problem and the migration
Files
Run
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

This lab answers the question lab-02's GPU numbers raise but don't explain: why is quantization="fp8" (W8A8) a different kind of thing than loading an AWQ checkpoint (W4A16)? Weight-only quant shrinks bytes and leaves all computation in high precision — its only failure mode is weight rounding error, which labs 01/03 showed is tame. W8A8 additionally runs the matmul itself in 8-bit (unlocking FP8/INT8 tensor cores — the throughput jump in lab-02's capture), which means activations must survive quantization too — and they're the hostile party. Every production decision between "fp8 for speed" and "AWQ for memory" is downstream of the asymmetry you'll measure here.

The deeper lesson is the shape of SmoothQuant's fix, because you'll reuse it forever: when a hard constraint can't be removed, look for a reparametrization that moves the difficulty to where you have better tools. Activations only afford one cheap scale; weights afford per-channel scales (lab-01) that eat outliers for breakfast. So divide each activation channel by s_j, multiply the matching weight column by s_j, and the product is bit-for-bit the same function — but the loudness now lives in the weights, where per-channel scales neutralize it. No retraining, no approximation in the transform itself. The only approximation remains the quantization, now applied to friendlier tensors.

Background: the outlier problem and the migration

Symmetric per-tensor int8: scale = max|X| / 127. With a channel 80× louder than the rest, the quiet channels — which carry most of the information — get 127 / 80 ≈ 1.6 effective levels. Their contribution to the matmul turns to gravel. That's the cliff (test_outliers_wreck_naive_w8a8: >5% relative matmul error from one setup; real perplexity explodes the same way — the LLM.int8() paper documents the phenomenon at scale).

The migration, per input channel j (SmoothQuant eq. 4):

s_j = max|X[:, j]|^α / max|W[:, j]|^(1−α)
X̂[:, j] = X[:, j] / s_j        Ŵ[:, j] = W[:, j] · s_j        X̂ Ŵᵀ ≡ X Wᵀ

α splits the difficulty: α = 1 dumps all activation loudness into the weights (overloading their quantizer), α = 0 does nothing; α ≈ 0.5 balances — equalizing the per-channel max ratios of both tensors. In practice s is computed once offline from calibration activations and folded into the previous layer's weights (LayerNorm gain or prior linear), so runtime sees zero extra ops. The smoothing is free at inference; that's why it shipped everywhere.

Files

starter.py — quantize_per_tensor, fake_quant, w8a8_matmul, smooth. Your work.
solution.py — reference.
test_lab.py — exactness of the reparametrization, the cliff, the rescue, the no-outlier control arm, and proof the magnitude actually moved.

Run

LAB_IMPL=starter pytest phase-06-quantization/labs/lab-04-activation-outliers-smoothquant -q
pytest phase-06-quantization/labs/lab-04-activation-outliers-smoothquant -q   # reference

What the tests prove

Test	What it pins
`test_smoothing_is_mathematically_exact`	`X̂ Ŵᵀ = X Wᵀ` to 1e-10 — the migration is a reparametrization, not an approximation. Establish this before measuring anything else (the experimental hygiene point: separate the exact transform from the lossy quantization, or you can't attribute the error)
`test_outliers_wreck_naive_w8a8`	The cliff: two loud channels out of 256 push matmul error past 5%
`test_smoothing_rescues_w8a8`	The headline: same inputs, error drops > 3× (typically ~10×) after migration — the SmoothQuant result, reproduced in 30 lines
`test_no_outliers_means_little_to_gain`	The control arm: tame activations quantize fine raw (< 2% error), and smoothing changes ~nothing. The fix targets a specific pathology; on healthy tensors it's inert — which is exactly what you want from an always-on transform
`test_migration_actually_moved_the_magnitude`	Mechanism check, not just outcome: X's loudest-to-median channel ratio collapses > 5×, W's max grows. The where it went of the migration

Hitchhiker's notes

Why are the outliers there at all? They emerge during training in large transformers (documented from ~6.7B up, LLM.int8() §3) and appear to function as attention/no-op signaling channels — removing them lobotomizes the model. They're also stable: the same channels are loud across inputs, which is precisely what makes offline calibration of s possible. A pathology you can calibrate against is an engineering problem; one that moves per-input would have been fatal to W8A8.
Per-token activation scales (one scale per row of X, computed on the fly) are the other standard mitigation, and what vLLM's fp8 "dynamic" mode does — they handle token-loudness but not channel-loudness (the scale is still shared across the row's channels), which is why smoothing and per-token scales compose rather than compete. Check upstream/vllm/model_executor/layers/quantization/fp8.py — the per-tensor vs per-token vs static-scale plumbing in Fp8LinearMethod is this exact taxonomy in code.
FP8 (e4m3) changes the constants, not the story. Floating-point 8-bit has more dynamic range than int8 (exponent bits), so the cliff is shallower — outliers cost precision rather than annihilating it. Hopper's FP8 tensor cores made W8A8 the default "fast mode"; the outlier discipline is why it usually just works now. The analysis you did here is why it sometimes doesn't (extreme models, exotic layers), and what to reach for then.
Folding s into the previous layer is the production detail worth savoring: the division by s becomes part of the LayerNorm weights, the multiplication lives in the quantized checkpoint. The runtime graph is identical to the unsmoothed model's. When you diff a SmoothQuant checkpoint against its base, all you see is slightly different numbers — the entire technique hides in plain sight.

Going further

Sweep α ∈ {0, 0.25, 0.5, 0.75, 1.0} on the outlier setup and plot W8A8 error. You'll see the U: α too low leaves X hard, too high makes W hard. The paper's 0.5 default is the bottom for typical magnitude ratios — find a setup where 0.75 wins (hint: make the weights unusually tame).
Implement per-token activation scales (scale_i = max|X[i]| / 127 per row) and compare: per-token alone vs smoothing alone vs both, on the outlier setup. Reproduces the design space the fp8 backends actually navigate.
Add a "quantize the smoothed weights with lab-01's per-channel int8" step and verify end-to-end W8A8 error lands near the fp baseline — you've now composed three labs into the actual SmoothQuant pipeline.

References

Xiao et al., SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (2022) — the migration, eq. 4 is your smooth: https://arxiv.org/abs/2211.10438
Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022) — the outlier phenomenon, documented and dissected: https://arxiv.org/abs/2208.07339
NVIDIA, FP8 Formats for Deep Learning (2022) — why e4m3's range softens the cliff: https://arxiv.org/abs/2209.05433
upstream/vllm/model_executor/layers/quantization/fp8.py — Fp8LinearMethod: the scale-mode taxonomy (static/dynamic, per-tensor/per-token) in production form.
Lab-02 — the GPU measurements this lab explains; lab-01 — the per-channel weight quantizer the migration relies on.

vLLM Mastery — From Zero to Maintainer