Phase 06 — Cheatsheet: Quantization

The one-liner
Two axes
Scale granularity
Dispatch (model is format-agnostic)
GPTQ vs AWQ
FP8 KV cache
Key upstream

The one-liner

Fewer weight bits → less HBM read/step → faster decode + more room for KV. Store round(W/s) + scale s; accuracy ∝ scale granularity. Format must match the GEMM kernel.

Two axes

What: weight-only (GPTQ/AWQ, 4-bit) = bandwidth/memory; weight+activation (FP8/INT8 W8A8) = also faster matmul (low-precision tensor cores).
Format: FP8(E4M3/E5M2), INT8/INT4, MXFP4/NVFP4, GPTQ, AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO.

Scale granularity

per-tensor (1 scale, worst) < per-channel (1/row) < per-group (1/128, best for 4-bit).

Dispatch (model is format-agnostic)

QuantizationConfig (from checkpoint) → get_quant_method(layer) → LinearMethodBase: create_weights (alloc int weights + scales) + apply (de/quant matmul → GEMM kernel, Phase 7). Linear just calls self.quant_method.apply(x).

GPTQ vs AWQ

Both post-training weight-only 4-bit w/ calibration. GPTQ: Hessian-based error min. AWQ: scale salient channels before rounding. Fast via Marlin kernels.

FP8 KV cache

Separate axis (kv_cache_dtype="fp8"): halves KV bytes → ~2× concurrency. Mix with any weight quant.

Key upstream

quantization/base_config.py:19 QuantizeMethodBase :28 create_weights :37 apply :70 Config :151 get_quant_method
quantization/fp8.py:100 Fp8Config :261 Fp8LinearMethod :316 create_weights :437 apply
quantization/__init__.py registry · quantization/awq.py · compressed_tensors/
layers/linear.py:182 Unquantized :231 LinearBase :410 ColumnParallel :1392 RowParallel

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

vLLM Mastery — From Zero to Maintainer