Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 06 — Exercises: Quantization

Contents


Warm-up (explain)

  1. Why does weight quantization speed up decode specifically? Tie it to Phase 0's physics.
  2. Quantize a value to int8: write the scale + round + dequant steps.
  3. per-tensor vs per-channel vs per-group scales — accuracy vs storage tradeoff.

Core (trace the code)

  1. What two methods does every quant format implement (base_config.py:28/:37)?
  2. How does a checkpoint's format become a per-layer method (get_quant_method, :151)?
  3. In Fp8LinearMethod.apply (fp8.py:437), where are scales used and the GEMM called?
  4. Why is FP8 (W8A8) able to speed the matmul while AWQ (weight-only) mainly speeds bandwidth?

Build (your lab)

  1. In lab-01, construct a matrix where per-tensor int8 loses a whole channel. Quantify the error gap vs per-channel.
  2. Extend to int4 per-group (group size 32). Compare error and storage to int8 per-channel.
  3. Add fake activation quantization (W8A8) and show the matmul can run in int8 then rescale.

Design (staff-level)

  1. A customer needs to fit a 70B model on 1×80GB GPU with decent quality. Walk your quant choice (weights? KV? which format?) and the accuracy validation you'd run first.
  2. Throughput improved with FP8 but a downstream eval regressed 2 points. Diagnose: which layers are most sensitive, and what mitigations exist (keep some layers fp16, per-group scales)?
  3. You want to add a new format (say a vendor's INT3). What exactly must you implement in vLLM, and what kernel work does it imply (Phase 7)?

Self-grading

4–7 and 11–13 are interview-grade. Could you draw the config→method→kernel dispatch and name the files? If not, re-read 01-deep-dive.md.