Phase 06 — Deep Dive: the quantization dispatch system

Paths relative to upstream/ at v0.22.1 @ 0decac0.

vllm/model_executor/layers/quantization/base_config.py   QuantizationConfig + QuantizeMethodBase
vllm/model_executor/layers/quantization/__init__.py      the registry of all methods
vllm/model_executor/layers/quantization/fp8.py           a complete method (FP8), end to end
vllm/model_executor/layers/quantization/awq.py           AWQ 4-bit weight-only
vllm/model_executor/layers/quantization/compressed_tensors/   the compressed-tensors family
vllm/model_executor/layers/linear.py                     where a Linear layer calls its method

1. The two base abstractions: base_config.py
2. A complete method: FP8 (fp8.py)
3. The registry: __init__.py
4. Where a Linear layer uses it: linear.py
5. The KV cache axis
Reading checklist

1. The two base abstractions: `base_config.py`

vllm/model_executor/layers/quantization/base_config.py:

class QuantizeMethodBase(ABC):           # :19
    def create_weights(self, layer, ...): ...   # :28  allocate int weights + scale params
    def apply(self, layer, x, ...) -> Tensor: ...  # :37  run the (de)quantized matmul

class QuantizationConfig(ABC):           # :70
    def get_quant_method(self, layer, prefix) -> QuantizeMethodBase | None: ...  # :151

This is the whole contract. A QuantizationConfig is parsed from the checkpoint (it knows "this is AWQ, group size 128"); for each layer the model builds, get_quant_method returns the right method object. LinearMethodBase is the linear-layer specialization of QuantizeMethodBase (defined in linear.py). Two methods — create_weights and apply — are all a new format needs. That's why vLLM supports a dozen formats: each is one config + one method class.

2. A complete method: FP8 (`fp8.py`)

class Fp8Config(QuantizationConfig) (:100) — parses FP8 settings from the checkpoint.
class Fp8LinearMethod(LinearMethodBase) (:261):
- create_weights (:316) — allocates the fp8 weight tensor and its scale(s) on the layer.
- apply (:437) — runs the FP8 matmul (dequantizing / using FP8 tensor cores), with the scales.

Read Fp8LinearMethod.apply and notice it dispatches to an FP8 GEMM kernel (CUTLASS / scaled mm, Phase 7). The method owns the numerics; the kernel does the math. FP8 is also weight+ activation capable (W8A8) — it can quantize the activation x too and use FP8 tensor cores, which is why FP8 can speed prefill, not just decode.

3. The registry: `init.py`

vllm/model_executor/layers/quantization/__init__.py maps a quant method name (from the checkpoint's config, e.g. "fp8", "awq", "compressed-tensors", "gptq_marlin", "gguf", "modelopt", "torchao") to its QuantizationConfig class. Adding a new format = register it here + write the config + method. Browse the directory listing — every file (fp8.py, awq.py, gguf.py, mxfp4.py, modelopt.py, torchao.py, compressed_tensors/…) is one entry.

4. Where a `Linear` layer uses it: `linear.py`

vllm/model_executor/layers/linear.py:

class UnquantizedLinearMethod(LinearMethodBase) (:182) — the default (no quant): apply (:220) is a plain matmul.
class LinearBase (:231), ColumnParallelLinear (:410), RowParallelLinear (:1392) — the linear layers models use (also tensor-parallel sharded, Phase 10). In __init__ each asks its QuantizationConfig for a method (get_quant_method) and stores it as self.quant_method; its forward calls self.quant_method.apply(self, x).

So the model never branches on format. It builds ColumnParallelLinear(...), which silently becomes FP8/AWQ/INT4/unquantized depending on the checkpoint. The same LlamaAttention.qkv_proj you saw in Phase 0 is quantized or not purely by which method got attached.

5. The KV cache axis

vllm/model_executor/layers/quantization/kv_cache.py — FP8 KV cache is configured separately (kv_cache_dtype="fp8"). It halves KV bytes/token → roughly doubles concurrency (Phase 0 lab-02), at a small accuracy cost. It's orthogonal to weight quantization — you can mix (e.g. AWQ weights + FP8 KV).

Reading checklist

QuantizeMethodBase — what do create_weights and apply each do?
get_quant_method — how does a checkpoint's format become a per-layer method?
Fp8LinearMethod.apply — find where scales are used and the GEMM is called.
In linear.py, how does ColumnParallelLinear acquire and call its quant method?
Why is FP8 "W8A8" able to speed the matmul, while AWQ (weight-only) mainly speeds bandwidth?

Now build it: 02-mini-build.md, then the labs.

vLLM Mastery — From Zero to Maintainer