Phase 06 — Exercises: Quantization
Contents
Warm-up (explain)
- Why does weight quantization speed up decode specifically? Tie it to Phase 0's physics.
- Quantize a value to int8: write the scale + round + dequant steps.
- per-tensor vs per-channel vs per-group scales — accuracy vs storage tradeoff.
Core (trace the code)
- What two methods does every quant format implement (
base_config.py:28/:37)? - How does a checkpoint's format become a per-layer method (
get_quant_method,:151)? - In
Fp8LinearMethod.apply(fp8.py:437), where are scales used and the GEMM called? - Why is FP8 (W8A8) able to speed the matmul while AWQ (weight-only) mainly speeds bandwidth?
Build (your lab)
- In lab-01, construct a matrix where per-tensor int8 loses a whole channel. Quantify the error gap vs per-channel.
- Extend to int4 per-group (group size 32). Compare error and storage to int8 per-channel.
- Add fake activation quantization (W8A8) and show the matmul can run in int8 then rescale.
Design (staff-level)
- A customer needs to fit a 70B model on 1×80GB GPU with decent quality. Walk your quant choice (weights? KV? which format?) and the accuracy validation you'd run first.
- Throughput improved with FP8 but a downstream eval regressed 2 points. Diagnose: which layers are most sensitive, and what mitigations exist (keep some layers fp16, per-group scales)?
- You want to add a new format (say a vendor's INT3). What exactly must you implement in vLLM, and what kernel work does it imply (Phase 7)?
Self-grading
4–7 and 11–13 are interview-grade. Could you draw the config→method→kernel dispatch and name the files? If not, re-read 01-deep-dive.md.