Phase 06 — Mini-Build: a per-channel int8 fake-quant linear

You'll build the smallest real quantization: store a weight matrix in int8 with per-channel scales, dequantize in the matmul, and measure the two things that matter — memory saved and round-trip error. This is exactly what create_weights + apply do for a real method, minus the GPU kernel.

The task (lab-01)
Why per-channel beats per-tensor (the key insight)
Definition of done
Map to the real engine

The task (lab-01)

Implement, in numpy:

quantize_per_channel(W) → (q_int8, scales) where W is (out, in); one scale per output channel (row). scale[o] = max(abs(W[o])) / 127; q_int8[o] = round(W[o] / scale[o]) clipped to [-127, 127].
dequantize(q_int8, scales) → W_approx (scales[:,None] * q_int8).
quant_linear(x, q_int8, scales) → x @ dequantize(...).T (the "apply" path).
memory_bytes(W) vs memory_bytes_quant(q_int8, scales) to show the saving.

Then in tests:

round-trip error ||W - dequant(quant(W))|| is small relative to ||W||,
per-channel beats per-tensor on a matrix with one large-magnitude row (outlier channel),
int8 storage is ~4× smaller than fp32 (1 byte vs 4, plus a few scale floats),
quant_linear(x, ...) ≈ x @ W.T within tolerance.

Why per-channel beats per-tensor (the key insight)

One channel with large weights forces a huge per-tensor scale, crushing the resolution of all the small channels. A per-channel scale gives each row its own dynamic range. You'll measure this — it's the reason real methods are at least per-channel, and 4-bit methods go per-group.

Definition of done

pytest phase-06-quantization/labs -q

Map to the real engine

your numpy	real vLLM
`quantize_per_channel` (offline)	how a checkpoint was quantized (GPTQ/AWQ/ModelOpt)
`create_weights` (store q + scales)	`Fp8LinearMethod.create_weights` (`fp8.py:316`)
`quant_linear` (dequant + matmul)	`LinearMethodBase.apply` (`fp8.py:437`) → a GEMM kernel (Phase 7)
per-channel vs per-tensor	per-tensor/channel/group scale choices in real configs

vLLM Mastery — From Zero to Maintainer

Phase 06 — Mini-Build: a per-channel int8 fake-quant linear

Contents

The task (lab-01)

Why per-channel beats per-tensor (the key insight)

Definition of done

Map to the real engine

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 06 — Mini-Build: a per-channel int8 fake-quant linear

Contents

The task (lab-01)

Why per-channel beats per-tensor (the key insight)

Definition of done

Map to the real engine