Phase 06 — The Hitchhiker's Guide to Quantization
← Phase 05 · Course home · Phase 07 →
Contents
- Don't Panic
- Step 1: The core idea — scale + round
- Step 2: The format zoo (don't memorize — recognize)
- Step 3: GPTQ vs AWQ (the two famous 4-bit methods)
- Step 4: How vLLM runs any of them — one interface
- The invariants to memorize
- What you'll do
Don't Panic
Weights are normally 16-bit floats. Quantization stores them in fewer bits (8, 4, even sub-4). Two payoffs, straight from Phase 0's physics: fewer bytes means less HBM to read each decode step (decode is memory-bandwidth-bound → faster) and less memory used (fit a bigger model, or more KV cache → higher concurrency). The whole trick is doing it without wrecking accuracy. This phase is the zoo of formats and how vLLM loads and runs them behind one clean interface.
fp16 weight W ──quantize──► int4 weights + scales (¼ the bytes)
│ GEMM kernel dequantizes on the fly
▼
same matmul result (approximately)
A 4-bit model reads ~¼ the weight bytes per step → can nearly double decode throughput and quarter weight memory. Quantization is often the single highest-leverage cost-per-token knob.
Step 1: The core idea — scale + round
To store a float tensor in int8, find a scale s so values fit in [-127, 127], then store
round(W / s) as int8 and keep s (a float) on the side. To use it: W ≈ s × int8. That's it.
The art is choosing s well so rounding error stays small:
- per-tensor scale: one
sfor the whole matrix (cheapest, least accurate). - per-channel scale: one
sper output channel (much better — outliers in one channel don't blow up the others). - per-group scale: one
sper small group of weights (e.g. 128) — best accuracy for 4-bit, more scales to store.
You'll implement per-channel int8 fake-quant in lab-01 and measure the round-trip error and
the memory saved.
Step 2: The format zoo (don't memorize — recognize)
Two axes organize everything:
Axis A — what gets quantized:
- weight-only (GPTQ, AWQ, most 4-bit): only weights are low-bit; activations stay fp16. Helps memory + decode bandwidth. Most common.
- weight + activation (FP8, INT8 "W8A8"): both low-bit; can use faster low-precision tensor cores for the matmul itself (helps compute too, e.g. prefill).
Axis B — the numeric format:
- FP8 (E4M3/E5M2): 8-bit float; great accuracy/speed on Hopper+; also used for the KV cache.
- INT8 / INT4: integer quant with scales.
- MXFP4 / NVFP4: 4-bit float "microscaling" formats (block-wise shared exponents) — frontier for 4-bit accuracy on Blackwell.
- GPTQ / AWQ: methods that produce 4-bit weights using calibration data (see Step 3).
- GGUF: the llama.cpp file format (various bit widths).
- compressed-tensors / ModelOpt / TorchAO: families/toolkits that emit quantized checkpoints vLLM can load.
You don't need all of them today. You need: fewer bits → less bandwidth/memory → faster decode, at some accuracy cost; the format must match the GEMM kernel that consumes it.
Step 3: GPTQ vs AWQ (the two famous 4-bit methods)
Both are post-training, weight-only 4-bit, using a little calibration data:
- GPTQ: minimizes the layer's output error using second-order (Hessian-based) information, quantizing weights column by column and compensating.
- AWQ (Activation-aware Weight Quantization): protects the most salient weight channels (those multiplied by large activations) by scaling them before rounding.
Both plug into vLLM the same way — as a LinearMethod (Step 4). The Marlin kernels make 4-bit
matmuls fast on GPU.
Step 4: How vLLM runs any of them — one interface
vLLM hides every format behind two abstractions (quantization/base_config.py):
QuantizationConfig— parsed from the checkpoint; knows the format and, viaget_quant_method(layer), hands back the right method for a given layer.LinearMethodBase(aQuantizeMethodBase) —create_weights()(allocate the int weights + scales) andapply()(run the quantized matmul, dequantizing as needed).
A Linear layer (Phase 14) doesn't know or care which quant method it has — it just calls
self.quant_method.apply(...). Swap FP8 for AWQ and the model code is unchanged. (Same
decoupling pattern as attention backends in Phase 4.) The matmul, though, must use a kernel
that understands the format (CUTLASS FP8, Marlin INT4, …) — Phase 7.
The invariants to memorize
- Fewer weight bits → less HBM read per step → faster decode (memory-bound); plus less memory.
- Quant = store
round(W/s)+ the scales; accuracy depends on scale granularity (per-tensor < per-channel < per-group). - Weight-only (GPTQ/AWQ) helps bandwidth/memory; weight+activation (FP8/INT8) can also speed the matmul.
- The format must match the GEMM kernel (Phase 7). Mismatch = wrong/slow.
- vLLM dispatches via
QuantizationConfig.get_quant_method→LinearMethodBase.{create_weights, apply}. Model code is format-agnostic. - FP8 KV cache is a separate axis: halves KV bytes → ~doubles concurrency (Phase 0 lab-02).
What you'll do
- Read: 01-deep-dive.md —
QuantizationConfig/LinearMethodBase, the FP8 method end to end, and whereLineardispatches, line-anchored. - Build: 02-mini-build.md — a per-channel int8 fake-quant linear.
- Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
lab-01-fake-quant-linear[CPU-OK]— int8 per-channel quant/dequant; measure error + memory.lab-02-quantize-and-eval[GPU-OPT]— fp16 vs FP8 vs AWQ-4bit throughput/memory (captured).lab-03-int4-groups-and-packing[CPU-OK]— the GPTQ/AWQ storage reality: group-wise scales (why group_size=128) and two-nibbles-per-byte packing, with the error/overhead trade measured.lab-04-activation-outliers-smoothquant[CPU-OK]— reproduce the activation-outlier cliff that breaks naive W8A8, then fix it with the SmoothQuant migration (an exact reparametrization).
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 05 · Course home · Phase 07 →