Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 06-04 — Activation Outliers & the SmoothQuant Migration [CPU-OK]

Labs 01 and 03 quantized weights — tensors you can study offline at your leisure, with any granularity of scales you fancy. This lab quantizes activations, and activations fight back. They exist only at runtime (scales must be cheap — per-tensor or per-token, not per-group), and in real LLMs they carry a famous pathology: a handful of channels run 10–100× louder than the rest, consistently, across all inputs — a trained-in fact of transformer feature geometry, not noise. One per-tensor scale set by the loudest channel crushes everyone else's resolution, and W8A8 accuracy falls off a cliff. You'll reproduce the cliff, then implement the elegant fix from SmoothQuant: you can't delete the outliers, but you can relocate them — migrate magnitude from the hard-to-quantize activations into the easy-to-quantize weights, via a reparametrization that is mathematically a no-op.

Contents


Why this lab exists

This lab answers the question lab-02's GPU numbers raise but don't explain: why is quantization="fp8" (W8A8) a different kind of thing than loading an AWQ checkpoint (W4A16)? Weight-only quant shrinks bytes and leaves all computation in high precision — its only failure mode is weight rounding error, which labs 01/03 showed is tame. W8A8 additionally runs the matmul itself in 8-bit (unlocking FP8/INT8 tensor cores — the throughput jump in lab-02's capture), which means activations must survive quantization too — and they're the hostile party. Every production decision between "fp8 for speed" and "AWQ for memory" is downstream of the asymmetry you'll measure here.

The deeper lesson is the shape of SmoothQuant's fix, because you'll reuse it forever: when a hard constraint can't be removed, look for a reparametrization that moves the difficulty to where you have better tools. Activations only afford one cheap scale; weights afford per-channel scales (lab-01) that eat outliers for breakfast. So divide each activation channel by s_j, multiply the matching weight column by s_j, and the product is bit-for-bit the same function — but the loudness now lives in the weights, where per-channel scales neutralize it. No retraining, no approximation in the transform itself. The only approximation remains the quantization, now applied to friendlier tensors.

Background: the outlier problem and the migration

Symmetric per-tensor int8: scale = max|X| / 127. With a channel 80× louder than the rest, the quiet channels — which carry most of the information — get 127 / 80 ≈ 1.6 effective levels. Their contribution to the matmul turns to gravel. That's the cliff (test_outliers_wreck_naive_w8a8: >5% relative matmul error from one setup; real perplexity explodes the same way — the LLM.int8() paper documents the phenomenon at scale).

The migration, per input channel j (SmoothQuant eq. 4):

s_j = max|X[:, j]|^α / max|W[:, j]|^(1−α)
X̂[:, j] = X[:, j] / s_j        Ŵ[:, j] = W[:, j] · s_j        X̂ Ŵᵀ ≡ X Wᵀ

α splits the difficulty: α = 1 dumps all activation loudness into the weights (overloading their quantizer), α = 0 does nothing; α ≈ 0.5 balances — equalizing the per-channel max ratios of both tensors. In practice s is computed once offline from calibration activations and folded into the previous layer's weights (LayerNorm gain or prior linear), so runtime sees zero extra ops. The smoothing is free at inference; that's why it shipped everywhere.

Files

  • starter.pyquantize_per_tensor, fake_quant, w8a8_matmul, smooth. Your work.
  • solution.py — reference.
  • test_lab.py — exactness of the reparametrization, the cliff, the rescue, the no-outlier control arm, and proof the magnitude actually moved.

Run

LAB_IMPL=starter pytest phase-06-quantization/labs/lab-04-activation-outliers-smoothquant -q
pytest phase-06-quantization/labs/lab-04-activation-outliers-smoothquant -q   # reference

What the tests prove

TestWhat it pins
test_smoothing_is_mathematically_exactX̂ Ŵᵀ = X Wᵀ to 1e-10 — the migration is a reparametrization, not an approximation. Establish this before measuring anything else (the experimental hygiene point: separate the exact transform from the lossy quantization, or you can't attribute the error)
test_outliers_wreck_naive_w8a8The cliff: two loud channels out of 256 push matmul error past 5%
test_smoothing_rescues_w8a8The headline: same inputs, error drops > 3× (typically ~10×) after migration — the SmoothQuant result, reproduced in 30 lines
test_no_outliers_means_little_to_gainThe control arm: tame activations quantize fine raw (< 2% error), and smoothing changes ~nothing. The fix targets a specific pathology; on healthy tensors it's inert — which is exactly what you want from an always-on transform
test_migration_actually_moved_the_magnitudeMechanism check, not just outcome: X's loudest-to-median channel ratio collapses > 5×, W's max grows. The where it went of the migration

Hitchhiker's notes

  • Why are the outliers there at all? They emerge during training in large transformers (documented from ~6.7B up, LLM.int8() §3) and appear to function as attention/no-op signaling channels — removing them lobotomizes the model. They're also stable: the same channels are loud across inputs, which is precisely what makes offline calibration of s possible. A pathology you can calibrate against is an engineering problem; one that moves per-input would have been fatal to W8A8.
  • Per-token activation scales (one scale per row of X, computed on the fly) are the other standard mitigation, and what vLLM's fp8 "dynamic" mode does — they handle token-loudness but not channel-loudness (the scale is still shared across the row's channels), which is why smoothing and per-token scales compose rather than compete. Check upstream/vllm/model_executor/layers/quantization/fp8.py — the per-tensor vs per-token vs static-scale plumbing in Fp8LinearMethod is this exact taxonomy in code.
  • FP8 (e4m3) changes the constants, not the story. Floating-point 8-bit has more dynamic range than int8 (exponent bits), so the cliff is shallower — outliers cost precision rather than annihilating it. Hopper's FP8 tensor cores made W8A8 the default "fast mode"; the outlier discipline is why it usually just works now. The analysis you did here is why it sometimes doesn't (extreme models, exotic layers), and what to reach for then.
  • Folding s into the previous layer is the production detail worth savoring: the division by s becomes part of the LayerNorm weights, the multiplication lives in the quantized checkpoint. The runtime graph is identical to the unsmoothed model's. When you diff a SmoothQuant checkpoint against its base, all you see is slightly different numbers — the entire technique hides in plain sight.

Going further

  • Sweep α ∈ {0, 0.25, 0.5, 0.75, 1.0} on the outlier setup and plot W8A8 error. You'll see the U: α too low leaves X hard, too high makes W hard. The paper's 0.5 default is the bottom for typical magnitude ratios — find a setup where 0.75 wins (hint: make the weights unusually tame).
  • Implement per-token activation scales (scale_i = max|X[i]| / 127 per row) and compare: per-token alone vs smoothing alone vs both, on the outlier setup. Reproduces the design space the fp8 backends actually navigate.
  • Add a "quantize the smoothed weights with lab-01's per-channel int8" step and verify end-to-end W8A8 error lands near the fp baseline — you've now composed three labs into the actual SmoothQuant pipeline.

References

  • Xiao et al., SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (2022) — the migration, eq. 4 is your smooth: https://arxiv.org/abs/2211.10438
  • Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022) — the outlier phenomenon, documented and dissected: https://arxiv.org/abs/2208.07339
  • NVIDIA, FP8 Formats for Deep Learning (2022) — why e4m3's range softens the cliff: https://arxiv.org/abs/2209.05433
  • upstream/vllm/model_executor/layers/quantization/fp8.pyFp8LinearMethod: the scale-mode taxonomy (static/dynamic, per-tensor/per-token) in production form.
  • Lab-02 — the GPU measurements this lab explains; lab-01 — the per-channel weight quantizer the migration relies on.