Phase 06 — Interview Questions: Quantization

Q1. Why does weight quantization speed up decode?

Model answer

Decode is memory-bandwidth-bound on reading the model weights from HBM each step. Storing weights in fewer bits (int4 ≈ ¼ the bytes) means ¼ the HBM traffic per step → higher decode throughput, even when the math is done in higher precision after dequant. It also frees HBM for more KV cache (higher concurrency). Prefill (compute-bound) benefits less unless you also quantize activations (W8A8) to use low-precision tensor cores.

Q2. How do you quantize a tensor to int8, and why do scale granularities matter?

Model answer

Pick a scale s so values fit in int8 range, store round(W/s) and s; reconstruct as s×int8. Granularity controls error: per-tensor uses one scale (an outlier channel forces a huge scale, crushing small channels); per-channel gives each output channel its own range; per-group (e.g. 128 weights) is finest, best for 4-bit, at the cost of more stored scales. You measure exactly this in lab-01.

Q3. GPTQ vs AWQ?

Model answer

Both are post-training, weight-only 4-bit methods using calibration data. GPTQ minimizes layer output error with second-order (Hessian) info, quantizing and compensating column by column. AWQ scales the most salient weight channels (those hit by large activations) before rounding to protect them. Both plug into vLLM as a LinearMethod and use fast 4-bit kernels (Marlin).

Q4. How does vLLM run many formats without the model knowing?

Model answer

A QuantizationConfig parsed from the checkpoint returns a per-layer LinearMethodBase via get_quant_method. The method's create_weights allocates int weights + scales and apply runs the (de)quantized matmul. A Linear layer just calls self.quant_method.apply(x) — it never branches on format. Adding a format = one config + one method class + a registry entry (quantization/__init__.py). The matmul must use a kernel that understands the format (Phase 7).

Q5. What's FP8 KV cache and when do you use it?

Model answer

Storing the KV cache in FP8 (instead of fp16) halves KV bytes/token, roughly doubling how many concurrent sequences fit (Phase 0 lab-02). It's orthogonal to weight quantization (mix freely). Use it when KV memory caps your concurrency and the small accuracy hit is acceptable; validate on your eval first.

Rapid-fire

Two methods a format implements? create_weights, apply.
Weight-only vs W8A8? bandwidth/memory vs also matmul speed.
4-bit accuracy trick? per-group scales (+ GPTQ/AWQ calibration).
Dispatch entry point? QuantizationConfig.get_quant_method.
FP8 KV cache effect? ~2× concurrency.

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 06 — Interview Questions: Quantization

Q1. Why does weight quantization speed up decode?

Q2. How do you quantize a tensor to int8, and why do scale granularities matter?

Q3. GPTQ vs AWQ?

Q4. How does vLLM run many formats without the model knowing?

Q5. What's FP8 KV cache and when do you use it?

Rapid-fire