Phase 06 — Interview Questions: Quantization
Q1. Why does weight quantization speed up decode?
Model answer
Decode is memory-bandwidth-bound on reading the model weights from HBM each step. Storing weights in fewer bits (int4 ≈ ¼ the bytes) means ¼ the HBM traffic per step → higher decode throughput, even when the math is done in higher precision after dequant. It also frees HBM for more KV cache (higher concurrency). Prefill (compute-bound) benefits less unless you also quantize activations (W8A8) to use low-precision tensor cores.
Q2. How do you quantize a tensor to int8, and why do scale granularities matter?
Model answer
Pick a scale s so values fit in int8 range, store round(W/s) and s; reconstruct as s×int8.
Granularity controls error: per-tensor uses one scale (an outlier channel forces a huge scale,
crushing small channels); per-channel gives each output channel its own range; per-group (e.g. 128
weights) is finest, best for 4-bit, at the cost of more stored scales. You measure exactly this in
lab-01.
Q3. GPTQ vs AWQ?
Model answer
Both are post-training, weight-only 4-bit methods using calibration data. GPTQ minimizes layer
output error with second-order (Hessian) info, quantizing and compensating column by column. AWQ
scales the most salient weight channels (those hit by large activations) before rounding to protect
them. Both plug into vLLM as a LinearMethod and use fast 4-bit kernels (Marlin).
Q4. How does vLLM run many formats without the model knowing?
Model answer
A QuantizationConfig parsed from the checkpoint returns a per-layer LinearMethodBase via
get_quant_method. The method's create_weights allocates int weights + scales and apply runs
the (de)quantized matmul. A Linear layer just calls self.quant_method.apply(x) — it never
branches on format. Adding a format = one config + one method class + a registry entry
(quantization/__init__.py). The matmul must use a kernel that understands the format (Phase 7).
Q5. What's FP8 KV cache and when do you use it?
Model answer
Storing the KV cache in FP8 (instead of fp16) halves KV bytes/token, roughly doubling how many concurrent sequences fit (Phase 0 lab-02). It's orthogonal to weight quantization (mix freely). Use it when KV memory caps your concurrency and the small accuracy hit is acceptable; validate on your eval first.
Rapid-fire
- Two methods a format implements?
create_weights,apply. - Weight-only vs W8A8? bandwidth/memory vs also matmul speed.
- 4-bit accuracy trick? per-group scales (+ GPTQ/AWQ calibration).
- Dispatch entry point?
QuantizationConfig.get_quant_method. - FP8 KV cache effect? ~2× concurrency.