Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 06 Labs — Quantization

Four labs that turn the format zoo into one mental model: a grid, a scale, and three questions (what grid? what scale granularity? weights only, or activations too?). The arc: build the primitive — int8, per-channel (lab-01); descend to int4, where groups and packing become survival gear (lab-03); cross to activations, where outliers break naive W8A8 and SmoothQuant's migration fixes it (lab-04); then measure what the families actually buy on real hardware (lab-02).

Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: the primitive, its two hard directions, then the measurement.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-06-quantization/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-06-quantization/labs/lab-01-fake-quant-linear -q

Contents


Labs

lab-01-fake-quant-linear [CPU-OK]

The primitive: symmetric int8 quantize/dequantize/matmul in ~20 lines, with the measurement that matters (<1% error, ~4× memory) and the design argument that drives the whole field — per-channel scales shrugging off the outlier row that wrecks a per-tensor scale. Maps function-for-function onto Fp8LinearMethod.create_weights/apply. Skills: scale/grid/granularity as the three questions; guard zero scales, clip after rounding; fake-quant as the error-isolation tool.

lab-02-quantize-and-eval [GPU-OPT]

fp16 vs FP8 (W8A8) vs AWQ-4bit (W4A16) on real vLLM, three meters per run: throughput, # GPU blocks, output sanity. The punchline: FP8 wins throughput (tensor cores + fewer bytes), AWQ wins KV capacity (smallest weights), neither dominates — they attack different terms. Captured, annotated numbers included. Skills: predicting the meter ordering from the cost model; weight-only vs weight+activation as a decision; honest quality verification vs eyeballing.

lab-03-int4-groups-and-packing [CPU-OK]

Int4's 15 levels force two mechanisms you'll build both of: group-wise scales (the group_size=128 on every GPTQ/AWQ model card — coverage windows that track local magnitude) and nibble packing (two int4 per byte, the literal checkpoint layout). The tests measure the fine-groups-vs-scale-overhead trade and pin the ~8× memory ratio. Skills: reading quantized checkpoint shapes; why ±7 not ±8; packing conventions as a bug class; where dequant really happens (in registers, fused).

lab-04-activation-outliers-smoothquant [CPU-OK]

Reproduce the famous cliff: a few 80×-loud activation channels (the documented LLM pathology) wreck per-tensor W8A8 — then implement SmoothQuant's fix, an exact reparametrization that migrates magnitude into the weights where per-channel scales neutralize it. Error drops >3× on the outlier setup; a control arm proves the transform is inert on healthy tensors. Skills: the outlier phenomenon; reparametrize-the- difficulty as a design move; why W8A8 needed FP8 + smoothing to become the default.

What you can do after this phase

Read any quantization= config or model-card string (W4A16, group_size=128, sym) as arithmetic you can verify; choose between weight-only and W8A8 from your deployment's binding constraint rather than fashion; predict the memory, throughput, and concurrency effects of a format before loading it; and recognize, in vllm/model_executor/layers/quantization/, every scheme as the lab-01 dance with different answers. Phase 7 goes below: the GEMM kernels that consume these formats.