Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 06 — Deep Dive: the quantization dispatch system

Paths relative to upstream/ at v0.22.1 @ 0decac0.

vllm/model_executor/layers/quantization/base_config.py   QuantizationConfig + QuantizeMethodBase
vllm/model_executor/layers/quantization/__init__.py      the registry of all methods
vllm/model_executor/layers/quantization/fp8.py           a complete method (FP8), end to end
vllm/model_executor/layers/quantization/awq.py           AWQ 4-bit weight-only
vllm/model_executor/layers/quantization/compressed_tensors/   the compressed-tensors family
vllm/model_executor/layers/linear.py                     where a Linear layer calls its method

Contents


1. The two base abstractions: base_config.py

vllm/model_executor/layers/quantization/base_config.py:

class QuantizeMethodBase(ABC):           # :19
    def create_weights(self, layer, ...): ...   # :28  allocate int weights + scale params
    def apply(self, layer, x, ...) -> Tensor: ...  # :37  run the (de)quantized matmul

class QuantizationConfig(ABC):           # :70
    def get_quant_method(self, layer, prefix) -> QuantizeMethodBase | None: ...  # :151

This is the whole contract. A QuantizationConfig is parsed from the checkpoint (it knows "this is AWQ, group size 128"); for each layer the model builds, get_quant_method returns the right method object. LinearMethodBase is the linear-layer specialization of QuantizeMethodBase (defined in linear.py). Two methods — create_weights and apply — are all a new format needs. That's why vLLM supports a dozen formats: each is one config + one method class.

2. A complete method: FP8 (fp8.py)

  • class Fp8Config(QuantizationConfig) (:100) — parses FP8 settings from the checkpoint.
  • class Fp8LinearMethod(LinearMethodBase) (:261):
    • create_weights (:316) — allocates the fp8 weight tensor and its scale(s) on the layer.
    • apply (:437) — runs the FP8 matmul (dequantizing / using FP8 tensor cores), with the scales.

Read Fp8LinearMethod.apply and notice it dispatches to an FP8 GEMM kernel (CUTLASS / scaled mm, Phase 7). The method owns the numerics; the kernel does the math. FP8 is also weight+ activation capable (W8A8) — it can quantize the activation x too and use FP8 tensor cores, which is why FP8 can speed prefill, not just decode.

3. The registry: __init__.py

vllm/model_executor/layers/quantization/__init__.py maps a quant method name (from the checkpoint's config, e.g. "fp8", "awq", "compressed-tensors", "gptq_marlin", "gguf", "modelopt", "torchao") to its QuantizationConfig class. Adding a new format = register it here + write the config + method. Browse the directory listing — every file (fp8.py, awq.py, gguf.py, mxfp4.py, modelopt.py, torchao.py, compressed_tensors/…) is one entry.

4. Where a Linear layer uses it: linear.py

vllm/model_executor/layers/linear.py:

  • class UnquantizedLinearMethod(LinearMethodBase) (:182) — the default (no quant): apply (:220) is a plain matmul.
  • class LinearBase (:231), ColumnParallelLinear (:410), RowParallelLinear (:1392) — the linear layers models use (also tensor-parallel sharded, Phase 10). In __init__ each asks its QuantizationConfig for a method (get_quant_method) and stores it as self.quant_method; its forward calls self.quant_method.apply(self, x).

So the model never branches on format. It builds ColumnParallelLinear(...), which silently becomes FP8/AWQ/INT4/unquantized depending on the checkpoint. The same LlamaAttention.qkv_proj you saw in Phase 0 is quantized or not purely by which method got attached.

5. The KV cache axis

vllm/model_executor/layers/quantization/kv_cache.py — FP8 KV cache is configured separately (kv_cache_dtype="fp8"). It halves KV bytes/token → roughly doubles concurrency (Phase 0 lab-02), at a small accuracy cost. It's orthogonal to weight quantization — you can mix (e.g. AWQ weights + FP8 KV).

Reading checklist

  • QuantizeMethodBase — what do create_weights and apply each do?
  • get_quant_method — how does a checkpoint's format become a per-layer method?
  • Fp8LinearMethod.apply — find where scales are used and the GEMM is called.
  • In linear.py, how does ColumnParallelLinear acquire and call its quant method?
  • Why is FP8 "W8A8" able to speed the matmul, while AWQ (weight-only) mainly speeds bandwidth?

Now build it: 02-mini-build.md, then the labs.