Phase 06 — Mini-Build: a per-channel int8 fake-quant linear
You'll build the smallest real quantization: store a weight matrix in int8 with per-channel
scales, dequantize in the matmul, and measure the two things that matter — memory saved and
round-trip error. This is exactly what create_weights + apply do for a real method, minus
the GPU kernel.
Contents
- The task (lab-01)
- Why per-channel beats per-tensor (the key insight)
- Definition of done
- Map to the real engine
The task (lab-01)
Implement, in numpy:
quantize_per_channel(W)→(q_int8, scales)whereWis(out, in); one scale per output channel (row).scale[o] = max(abs(W[o])) / 127;q_int8[o] = round(W[o] / scale[o])clipped to[-127, 127].dequantize(q_int8, scales)→W_approx(scales[:,None] * q_int8).quant_linear(x, q_int8, scales)→x @ dequantize(...).T(the "apply" path).memory_bytes(W)vsmemory_bytes_quant(q_int8, scales)to show the saving.
Then in tests:
- round-trip error
||W - dequant(quant(W))||is small relative to||W||, - per-channel beats per-tensor on a matrix with one large-magnitude row (outlier channel),
- int8 storage is ~4× smaller than fp32 (1 byte vs 4, plus a few scale floats),
quant_linear(x, ...)≈x @ W.Twithin tolerance.
Why per-channel beats per-tensor (the key insight)
One channel with large weights forces a huge per-tensor scale, crushing the resolution of all the small channels. A per-channel scale gives each row its own dynamic range. You'll measure this — it's the reason real methods are at least per-channel, and 4-bit methods go per-group.
Definition of done
pytest phase-06-quantization/labs -q
Map to the real engine
| your numpy | real vLLM |
|---|---|
quantize_per_channel (offline) | how a checkpoint was quantized (GPTQ/AWQ/ModelOpt) |
create_weights (store q + scales) | Fp8LinearMethod.create_weights (fp8.py:316) |
quant_linear (dequant + matmul) | LinearMethodBase.apply (fp8.py:437) → a GEMM kernel (Phase 7) |
| per-channel vs per-tensor | per-tensor/channel/group scale choices in real configs |