Phase 06 — Deep Dive: the quantization dispatch system
Paths relative to
upstream/atv0.22.1 @ 0decac0.vllm/model_executor/layers/quantization/base_config.py QuantizationConfig + QuantizeMethodBase vllm/model_executor/layers/quantization/__init__.py the registry of all methods vllm/model_executor/layers/quantization/fp8.py a complete method (FP8), end to end vllm/model_executor/layers/quantization/awq.py AWQ 4-bit weight-only vllm/model_executor/layers/quantization/compressed_tensors/ the compressed-tensors family vllm/model_executor/layers/linear.py where a Linear layer calls its method
Contents
- 1. The two base abstractions:
base_config.py - 2. A complete method: FP8 (
fp8.py) - 3. The registry:
__init__.py - 4. Where a
Linearlayer uses it:linear.py - 5. The KV cache axis
- Reading checklist
1. The two base abstractions: base_config.py
vllm/model_executor/layers/quantization/base_config.py:
class QuantizeMethodBase(ABC): # :19
def create_weights(self, layer, ...): ... # :28 allocate int weights + scale params
def apply(self, layer, x, ...) -> Tensor: ... # :37 run the (de)quantized matmul
class QuantizationConfig(ABC): # :70
def get_quant_method(self, layer, prefix) -> QuantizeMethodBase | None: ... # :151
This is the whole contract. A QuantizationConfig is parsed from the checkpoint (it knows "this
is AWQ, group size 128"); for each layer the model builds, get_quant_method returns the right
method object. LinearMethodBase is the linear-layer specialization of QuantizeMethodBase
(defined in linear.py). Two methods — create_weights and apply — are all a new format
needs. That's why vLLM supports a dozen formats: each is one config + one method class.
2. A complete method: FP8 (fp8.py)
class Fp8Config(QuantizationConfig)(:100) — parses FP8 settings from the checkpoint.class Fp8LinearMethod(LinearMethodBase)(:261):create_weights(:316) — allocates the fp8 weight tensor and its scale(s) on the layer.apply(:437) — runs the FP8 matmul (dequantizing / using FP8 tensor cores), with the scales.
Read Fp8LinearMethod.apply and notice it dispatches to an FP8 GEMM kernel (CUTLASS / scaled mm,
Phase 7). The method owns the numerics; the kernel does the math. FP8 is also weight+
activation capable (W8A8) — it can quantize the activation x too and use FP8 tensor cores,
which is why FP8 can speed prefill, not just decode.
3. The registry: __init__.py
vllm/model_executor/layers/quantization/__init__.py maps a quant method name (from the
checkpoint's config, e.g. "fp8", "awq", "compressed-tensors", "gptq_marlin", "gguf",
"modelopt", "torchao") to its QuantizationConfig class. Adding a new format = register it
here + write the config + method. Browse the directory listing — every file (fp8.py, awq.py,
gguf.py, mxfp4.py, modelopt.py, torchao.py, compressed_tensors/…) is one entry.
4. Where a Linear layer uses it: linear.py
vllm/model_executor/layers/linear.py:
class UnquantizedLinearMethod(LinearMethodBase)(:182) — the default (no quant):apply(:220) is a plain matmul.class LinearBase(:231),ColumnParallelLinear(:410),RowParallelLinear(:1392) — the linear layers models use (also tensor-parallel sharded, Phase 10). In__init__each asks itsQuantizationConfigfor a method (get_quant_method) and stores it asself.quant_method; itsforwardcallsself.quant_method.apply(self, x).
So the model never branches on format. It builds ColumnParallelLinear(...), which silently
becomes FP8/AWQ/INT4/unquantized depending on the checkpoint. The same LlamaAttention.qkv_proj
you saw in Phase 0 is quantized or not purely by which method got attached.
5. The KV cache axis
vllm/model_executor/layers/quantization/kv_cache.py — FP8 KV cache is configured separately
(kv_cache_dtype="fp8"). It halves KV bytes/token → roughly doubles concurrency (Phase 0 lab-02),
at a small accuracy cost. It's orthogonal to weight quantization — you can mix (e.g. AWQ weights +
FP8 KV).
Reading checklist
-
QuantizeMethodBase— what docreate_weightsandapplyeach do? -
get_quant_method— how does a checkpoint's format become a per-layer method? -
Fp8LinearMethod.apply— find where scales are used and the GEMM is called. -
In
linear.py, how doesColumnParallelLinearacquire and call its quant method? - Why is FP8 "W8A8" able to speed the matmul, while AWQ (weight-only) mainly speeds bandwidth?
Now build it: 02-mini-build.md, then the labs.