Phase 06 — Cheatsheet: Quantization
Contents
- The one-liner
- Two axes
- Scale granularity
- Dispatch (model is format-agnostic)
- GPTQ vs AWQ
- FP8 KV cache
- Key upstream
The one-liner
Fewer weight bits → less HBM read/step → faster decode + more room for KV. Store round(W/s) +
scale s; accuracy ∝ scale granularity. Format must match the GEMM kernel.
Two axes
- What: weight-only (GPTQ/AWQ, 4-bit) = bandwidth/memory; weight+activation (FP8/INT8 W8A8) = also faster matmul (low-precision tensor cores).
- Format: FP8(E4M3/E5M2), INT8/INT4, MXFP4/NVFP4, GPTQ, AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO.
Scale granularity
per-tensor (1 scale, worst) < per-channel (1/row) < per-group (1/128, best for 4-bit).
Dispatch (model is format-agnostic)
QuantizationConfig (from checkpoint) → get_quant_method(layer) → LinearMethodBase:
create_weights (alloc int weights + scales) + apply (de/quant matmul → GEMM kernel, Phase 7).
Linear just calls self.quant_method.apply(x).
GPTQ vs AWQ
Both post-training weight-only 4-bit w/ calibration. GPTQ: Hessian-based error min. AWQ: scale salient channels before rounding. Fast via Marlin kernels.
FP8 KV cache
Separate axis (kv_cache_dtype="fp8"): halves KV bytes → ~2× concurrency. Mix with any weight quant.
Key upstream
quantization/base_config.py:19QuantizeMethodBase :28 create_weights :37 apply :70 Config :151 get_quant_methodquantization/fp8.py:100Fp8Config :261 Fp8LinearMethod :316 create_weights :437 applyquantization/__init__.pyregistry ·quantization/awq.py·compressed_tensors/layers/linear.py:182Unquantized :231 LinearBase :410 ColumnParallel :1392 RowParallel
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md