Lab 06-01 — Int8 Per-Channel Fake-Quant Linear [CPU-OK]
Strip away the format zoo — FP8, AWQ, GPTQ, GGUF, NVFP4, compressed-tensors — and every quantization scheme in vLLM reduces to the same three-step dance you'll build here: pick a scale, round to a grid, multiply back when you compute. This lab implements the smallest version with real teeth (int8, symmetric, per-channel) and measures the only two numbers anyone actually cares about: bytes saved (~4×) and accuracy lost (<1% — if you choose scales wisely, which is the lab's central drama). The per-channel-vs-per-tensor showdown you'll run on an outlier matrix is, in miniature, the design argument behind half the quantization literature.
Contents
- Why this lab exists
- Background: quantization is a grid and a scale
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Quantization has the worst signal-to-jargon ratio in inference engineering. Engineers who can deploy AWQ models often can't answer "what is a scale?", and that gap becomes expensive the day quality regresses after a quantization change and nobody can reason about why. The cure is to implement the primitive once, small enough to hold in your head: ~20 lines of numpy, every choice explicit. After this lab, every format in the zoo parses as "the same dance with different answers to three questions" — what grid (int8/int4/fp8), what granularity of scale (tensor/channel/group/token), what gets quantized (weights only, or activations too). Labs 03 and 04 then vary exactly those answers.
"Fake quant" — quantize, then dequantize back to float for the matmul — is the standard study technique, and worth understanding as such: it isolates the rounding error (the accuracy question) from the kernel speedup (the performance question, which needs real int8 hardware paths — lab-02 measures that side). Numerically, fake quant and a real quantized kernel compute the same thing; one of them just tells you the truth on a laptop.
Background: quantization is a grid and a scale
Symmetric int8 quantization of a tensor region: scale = max|w| / 127, then
q = round(w / scale) — every value snapped to the nearest of 255 grid points spanning
[−max, +max]. The error per value is at most scale/2, so everything reduces to
making scale small, and scale is set by the loudest value the scale must cover.
Hence granularity:
- Per-tensor: one scale. The loudest weight in the matrix sets the resolution for every weight. One outlier row → everyone else's grid coarsens 100×.
- Per-channel (one scale per output row): an outlier row only ruins itself — and it doesn't even do that, since its own scale fits it. Cost: a few hundred floats of scale storage, amortized to nothing. This is why per-channel is the floor standard for weights, and the comparison test makes the argument with data.
The memory ledger: int8 weight = 1 byte (vs 4 for fp32), plus out_features fp32
scales — for a 100×100 matrix, 10,000 + 400 bytes vs 40,000: the ~4× in
test_memory_saving_about_4x, and (per Phase 0 lab-04, since decode is
bandwidth-bound) the rough ceiling on weight-only's decode speedup too.
Files
starter.py—quantize_per_channel,quantize_per_tensor,dequantize,quant_linear, memory helpers. Your work.solution.py— reference.test_lab.py— round-trip error, the 4×, the outlier showdown, matmul accuracy.
Run
LAB_IMPL=starter pytest phase-06-quantization/labs/lab-01-fake-quant-linear -q
pytest phase-06-quantization/labs/lab-01-fake-quant-linear -q # reference
What to implement
Per the formulas in 02-mini-build.md: quantize_per_channel
(scale per output row, max|row|/127, round, clip), quantize_per_tensor (one scalar,
for the showdown), dequantize (scales broadcast back), quant_linear
(x @ dequantize(q, s).T), and the byte accounting. Two details that separate working
from almost-working: guard zero scales (an all-zero row divides by zero; the
convention is scale=1 for empty rows), and clip after rounding (round(127.4) = 127
but round(127.6) = 128, which overflows int8 — the classic one-value-corrupted bug).
What the tests prove
| Test | What it pins |
|---|---|
test_roundtrip_error_small | < 1% relative error for Gaussian weights — int8 per-channel is almost free, which is why "int8 weights hurt quality" is usually a myth and a misconfiguration |
test_memory_saving_about_4x | The ledger: weights dominate, scales are noise |
test_per_channel_beats_per_tensor_on_outlier | One row scaled 100×: per-tensor error blows up (the outlier sets everyone's grid), per-channel shrugs. The single most important design fact in the phase — labs 03 and 04 are both elaborations of it |
test_quant_linear_matches_fp_matmul | The error survives the matmul proportionally — rounding noise stays noise, it doesn't amplify (for well-conditioned inputs; the pathological cases are lab-04's subject) |
Hitchhiker's notes
- Why scales are per output channel: each output row's weights form one dot
product; scaling that row by
sscales its output bys, so the dequant multiply can be applied to the result — after the integer matmul, one multiply per output. Scales per input channel wouldn't factor out this way (they'd need to multiply inside the accumulation). Granularity choices in every real format are constrained by "can the scale be applied outside the hot loop?" — a kernel-shaped constraint on a math-shaped choice. (Group-wise scales, lab-03, deliberately pay the inside-the-loop cost for resolution.) - Map to upstream:
Fp8LinearMethod.create_weights(fp8.py:316) allocates what yourquantize_*produces (weight tensor + scale tensors);apply(fp8.py:437) is yourquant_linearwith the dequant fused into the GEMM epilogue. EveryQuantizationConfigsubclass inupstream/vllm/model_executor/layers/quantization/is this same pair of responsibilities with different formats. - Symmetric vs asymmetric: you built symmetric (grid centered on 0, no zero-point). Weights are roughly zero-centered so it costs little. Activations post-ReLU/GELU are not zero-centered — asymmetric (scale + zero-point) earns its complexity there. File under "why the zoo exists."
round()is banker's rounding in numpy (ties to even). Real quantizers vary (round-half-away, stochastic rounding in training contexts); for ties the difference is one grid step on a measure-zero set — but when comparing your output to a reference quantizer bit-for-bit, rounding mode is the first suspect. Conventions, again.
Going further
- Plot relative error vs bit-width by generalizing to
levels = 2^b − 1for b ∈ {8, 6, 4, 3, 2}: the hockey stick at 4 bits is why lab-03 needs groups, and the cliff at 2 is why binary/ternary methods need retraining rather than post-hoc rounding. - Quantize an actual layer: pull a weight matrix out of a small HF checkpoint (or use
mini_vllm's toy model with a fixed seed), quantize per-channel, and measure the output drift on real activations rather than Gaussians — the distributional change is usually invisible; knowing how to check is the skill. - Implement the integer-arithmetic version:
(x_q @ q.T) * (s_x * s_w)with int32 accumulation, and verify it matches your fake-quant within rounding. That's what the tensor cores actually compute — and the moment you see why accumulators must be wider than operands.
References
upstream/vllm/model_executor/layers/quantization/fp8.py:316,437—create_weights/apply: your two halves, in production.- Jacob et al., Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (2017) — the foundational scale/zero-point formulation: https://arxiv.org/abs/1712.05877
- Gholami et al., A Survey of Quantization Methods for Efficient Neural Network Inference (2021) — the map of the zoo: https://arxiv.org/abs/2103.13630
- Phase 0 lab-04 — why fewer weight bytes ≈ proportional decode speedup.
- Labs 03 (int4 + groups) and 04 (activations + smoothing) — the two hard directions from this baseline.