Phase 06 — Exercises: Quantization

Warm-up (explain)

Why does weight quantization speed up decode specifically? Tie it to Phase 0's physics.
Quantize a value to int8: write the scale + round + dequant steps.
per-tensor vs per-channel vs per-group scales — accuracy vs storage tradeoff.

What two methods does every quant format implement (base_config.py:28/:37)?
How does a checkpoint's format become a per-layer method (get_quant_method, :151)?
In Fp8LinearMethod.apply (fp8.py:437), where are scales used and the GEMM called?
Why is FP8 (W8A8) able to speed the matmul while AWQ (weight-only) mainly speeds bandwidth?

In lab-01, construct a matrix where per-tensor int8 loses a whole channel. Quantify the error gap vs per-channel.
Extend to int4 per-group (group size 32). Compare error and storage to int8 per-channel.
Add fake activation quantization (W8A8) and show the matmul can run in int8 then rescale.

A customer needs to fit a 70B model on 1×80GB GPU with decent quality. Walk your quant choice (weights? KV? which format?) and the accuracy validation you'd run first.
Throughput improved with FP8 but a downstream eval regressed 2 points. Diagnose: which layers are most sensitive, and what mitigations exist (keep some layers fp16, per-group scales)?
You want to add a new format (say a vendor's INT3). What exactly must you implement in vLLM, and what kernel work does it imply (Phase 7)?

4–7 and 11–13 are interview-grade. Could you draw the config→method→kernel dispatch and name the files? If not, re-read 01-deep-dive.md.