Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 06 — The Hitchhiker's Guide to Quantization

Phase 05 · Course home · Phase 07

Contents


Don't Panic

Weights are normally 16-bit floats. Quantization stores them in fewer bits (8, 4, even sub-4). Two payoffs, straight from Phase 0's physics: fewer bytes means less HBM to read each decode step (decode is memory-bandwidth-bound → faster) and less memory used (fit a bigger model, or more KV cache → higher concurrency). The whole trick is doing it without wrecking accuracy. This phase is the zoo of formats and how vLLM loads and runs them behind one clean interface.

fp16 weight W  ──quantize──►  int4 weights + scales  (¼ the bytes)
                                   │  GEMM kernel dequantizes on the fly
                                   ▼
                               same matmul result (approximately)

A 4-bit model reads ~¼ the weight bytes per step → can nearly double decode throughput and quarter weight memory. Quantization is often the single highest-leverage cost-per-token knob.


Step 1: The core idea — scale + round

To store a float tensor in int8, find a scale s so values fit in [-127, 127], then store round(W / s) as int8 and keep s (a float) on the side. To use it: W ≈ s × int8. That's it. The art is choosing s well so rounding error stays small:

  • per-tensor scale: one s for the whole matrix (cheapest, least accurate).
  • per-channel scale: one s per output channel (much better — outliers in one channel don't blow up the others).
  • per-group scale: one s per small group of weights (e.g. 128) — best accuracy for 4-bit, more scales to store.

You'll implement per-channel int8 fake-quant in lab-01 and measure the round-trip error and the memory saved.


Step 2: The format zoo (don't memorize — recognize)

Two axes organize everything:

Axis A — what gets quantized:

  • weight-only (GPTQ, AWQ, most 4-bit): only weights are low-bit; activations stay fp16. Helps memory + decode bandwidth. Most common.
  • weight + activation (FP8, INT8 "W8A8"): both low-bit; can use faster low-precision tensor cores for the matmul itself (helps compute too, e.g. prefill).

Axis B — the numeric format:

  • FP8 (E4M3/E5M2): 8-bit float; great accuracy/speed on Hopper+; also used for the KV cache.
  • INT8 / INT4: integer quant with scales.
  • MXFP4 / NVFP4: 4-bit float "microscaling" formats (block-wise shared exponents) — frontier for 4-bit accuracy on Blackwell.
  • GPTQ / AWQ: methods that produce 4-bit weights using calibration data (see Step 3).
  • GGUF: the llama.cpp file format (various bit widths).
  • compressed-tensors / ModelOpt / TorchAO: families/toolkits that emit quantized checkpoints vLLM can load.

You don't need all of them today. You need: fewer bits → less bandwidth/memory → faster decode, at some accuracy cost; the format must match the GEMM kernel that consumes it.


Step 3: GPTQ vs AWQ (the two famous 4-bit methods)

Both are post-training, weight-only 4-bit, using a little calibration data:

  • GPTQ: minimizes the layer's output error using second-order (Hessian-based) information, quantizing weights column by column and compensating.
  • AWQ (Activation-aware Weight Quantization): protects the most salient weight channels (those multiplied by large activations) by scaling them before rounding.

Both plug into vLLM the same way — as a LinearMethod (Step 4). The Marlin kernels make 4-bit matmuls fast on GPU.


Step 4: How vLLM runs any of them — one interface

vLLM hides every format behind two abstractions (quantization/base_config.py):

  • QuantizationConfig — parsed from the checkpoint; knows the format and, via get_quant_method(layer), hands back the right method for a given layer.
  • LinearMethodBase (a QuantizeMethodBase) — create_weights() (allocate the int weights + scales) and apply() (run the quantized matmul, dequantizing as needed).

A Linear layer (Phase 14) doesn't know or care which quant method it has — it just calls self.quant_method.apply(...). Swap FP8 for AWQ and the model code is unchanged. (Same decoupling pattern as attention backends in Phase 4.) The matmul, though, must use a kernel that understands the format (CUTLASS FP8, Marlin INT4, …) — Phase 7.


The invariants to memorize

  1. Fewer weight bits → less HBM read per step → faster decode (memory-bound); plus less memory.
  2. Quant = store round(W/s) + the scale s; accuracy depends on scale granularity (per-tensor < per-channel < per-group).
  3. Weight-only (GPTQ/AWQ) helps bandwidth/memory; weight+activation (FP8/INT8) can also speed the matmul.
  4. The format must match the GEMM kernel (Phase 7). Mismatch = wrong/slow.
  5. vLLM dispatches via QuantizationConfig.get_quant_methodLinearMethodBase.{create_weights, apply}. Model code is format-agnostic.
  6. FP8 KV cache is a separate axis: halves KV bytes → ~doubles concurrency (Phase 0 lab-02).

What you'll do

  • Read: 01-deep-dive.mdQuantizationConfig/LinearMethodBase, the FP8 method end to end, and where Linear dispatches, line-anchored.
  • Build: 02-mini-build.md — a per-channel int8 fake-quant linear.
  • Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
    • lab-01-fake-quant-linear [CPU-OK] — int8 per-channel quant/dequant; measure error + memory.
    • lab-02-quantize-and-eval [GPU-OPT] — fp16 vs FP8 vs AWQ-4bit throughput/memory (captured).
    • lab-03-int4-groups-and-packing [CPU-OK] — the GPTQ/AWQ storage reality: group-wise scales (why group_size=128) and two-nibbles-per-byte packing, with the error/overhead trade measured.
    • lab-04-activation-outliers-smoothquant [CPU-OK] — reproduce the activation-outlier cliff that breaks naive W8A8, then fix it with the SmoothQuant migration (an exact reparametrization).
  • Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

Phase 05 · Course home · Phase 07