Phase 06 Labs — Quantization
Four labs that turn the format zoo into one mental model: a grid, a scale, and three questions (what grid? what scale granularity? weights only, or activations too?). The arc: build the primitive — int8, per-channel (lab-01); descend to int4, where groups and packing become survival gear (lab-03); cross to activations, where outliers break naive W8A8 and SmoothQuant's migration fixes it (lab-04); then measure what the families actually buy on real hardware (lab-02).
Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: the
primitive, its two hard directions, then the measurement.) CPU labs follow the standard
contract — starter.py (your work), solution.py (reference), test_lab.py (the spec);
default runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-06-quantization/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-06-quantization/labs/lab-01-fake-quant-linear -q
Contents
- lab-01-fake-quant-linear
[CPU-OK] - lab-02-quantize-and-eval
[GPU-OPT] - lab-03-int4-groups-and-packing
[CPU-OK] - lab-04-activation-outliers-smoothquant
[CPU-OK] - What you can do after this phase
Labs
lab-01-fake-quant-linear [CPU-OK]
The primitive: symmetric int8 quantize/dequantize/matmul in ~20 lines, with the
measurement that matters (<1% error, ~4× memory) and the design argument that drives the
whole field — per-channel scales shrugging off the outlier row that wrecks a per-tensor
scale. Maps function-for-function onto Fp8LinearMethod.create_weights/apply.
Skills: scale/grid/granularity as the three questions; guard zero scales, clip after
rounding; fake-quant as the error-isolation tool.
lab-02-quantize-and-eval [GPU-OPT]
fp16 vs FP8 (W8A8) vs AWQ-4bit (W4A16) on real vLLM, three meters per run: throughput,
# GPU blocks, output sanity. The punchline: FP8 wins throughput (tensor cores + fewer
bytes), AWQ wins KV capacity (smallest weights), neither dominates — they attack
different terms. Captured, annotated numbers included. Skills: predicting the meter
ordering from the cost model; weight-only vs weight+activation as a decision; honest
quality verification vs eyeballing.
lab-03-int4-groups-and-packing [CPU-OK]
Int4's 15 levels force two mechanisms you'll build both of: group-wise scales (the
group_size=128 on every GPTQ/AWQ model card — coverage windows that track local
magnitude) and nibble packing (two int4 per byte, the literal checkpoint layout). The
tests measure the fine-groups-vs-scale-overhead trade and pin the ~8× memory ratio.
Skills: reading quantized checkpoint shapes; why ±7 not ±8; packing conventions as a
bug class; where dequant really happens (in registers, fused).
lab-04-activation-outliers-smoothquant [CPU-OK]
Reproduce the famous cliff: a few 80×-loud activation channels (the documented LLM pathology) wreck per-tensor W8A8 — then implement SmoothQuant's fix, an exact reparametrization that migrates magnitude into the weights where per-channel scales neutralize it. Error drops >3× on the outlier setup; a control arm proves the transform is inert on healthy tensors. Skills: the outlier phenomenon; reparametrize-the- difficulty as a design move; why W8A8 needed FP8 + smoothing to become the default.
What you can do after this phase
Read any quantization= config or model-card string (W4A16, group_size=128, sym) as
arithmetic you can verify; choose between weight-only and W8A8 from your deployment's
binding constraint rather than fashion; predict the memory, throughput, and concurrency
effects of a format before loading it; and recognize, in
vllm/model_executor/layers/quantization/, every scheme as the lab-01 dance with
different answers. Phase 7 goes below: the GEMM kernels that consume these formats.