Lab 06-02 — Quantize and Evaluate on Real vLLM [GPU-OPT]
The CPU labs taught you what quantization is; this one measures what it buys — and,
crucially, that different formats buy different things. You'll run the same model
three ways — fp16 baseline, FP8 (weight+activation), AWQ 4-bit (weight-only) — and read
three meters per run: generation throughput, # GPU blocks (the leftover-HBM capacity
meter from Phase 2 lab-03), and output sanity. The punchline the numbers deliver: FP8
wins throughput, AWQ wins memory, and neither dominates — because they attack
different terms of the cost model, which is the understanding that turns "should we
quantize?" into the well-posed question "which constraint are we buying out of?"
No GPU? Don't panic. The captured numbers below are annotated against the cost model; the reasoning is the lab.
Contents
- Why this lab exists
- Background: the two families buy different things
- Requirements
- Steps
- Captured output (real run, Qwen2.5-0.5B, L4 24GB, vLLM 0.22.1, trimmed)
- Reading the numbers
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
"Quantization makes models faster and smaller" is true the way "exercise makes you healthier" is true — directionally right, useless for decisions. The decision-grade version requires knowing which resource binds your deployment: if you're KV-capacity-bound (concurrency limited by blocks — Phase 2's story), weight-only 4-bit frees the most HBM for cache; if you're compute/bandwidth-bound on the GEMMs, W8A8 FP8 engages the 8-bit tensor cores and halves weight traffic during compute; if you're quality-paranoid, weight-only at 8-bit is the conservative floor. This lab has you measure all three columns of that decision on one model, so the trade-offs stop being slogans.
It's also a drill in reading the engine's meters as a coherent story: throughput from
the generation log, capacity from # GPU blocks, quality from outputs. Three meters,
one cost model — if they don't reconcile, you've misunderstood something, and finding
what is the actual exercise (see the AWQ throughput surprise below).
Background: the two families buy different things
- Weight-only (AWQ/GPTQ int4, GGUF, int8) — weights shrink in HBM (≈ 4–8×), so: more leftover HBM → more KV blocks → more concurrency; less weight traffic per decode step → faster bandwidth-bound decode. But the matmul still runs in fp16 — every weight is dequantized (in registers, lab-03) on the way into the multiply. No tensor- core speedup; at large batch (compute-bound — Phase 0 lab-04), the dequant overhead can even cost a little.
- Weight+activation (FP8 W8A8, INT8 SmoothQuant-style) — weights and the matmul itself go 8-bit: half the weight bytes and ~2× the tensor-core math rate. Wins compute-bound regimes too. The price: activations must survive quantization — lab-04's outlier drama — which is why this family needed Hopper-era FP8 (more dynamic range) and smoothing tricks to become the default fast path.
- KV-cache quantization (orthogonal, composable): shrinks the other HBM consumer.
Phase 0 lab-02's
dtype_byteslever. Not measured here, but it stacks with either.
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
# AWQ/FP8 variants exist on the Hub for many models (suffixes like -AWQ, -FP8);
# or use quantization="fp8" for online weight conversion of the base model.
Steps
from vllm import LLM, SamplingParams
def run(model, **kw):
llm = LLM(model=model, gpu_memory_utilization=0.5, max_model_len=1024, **kw)
out = llm.generate(["Explain attention in one sentence:"] * 16,
SamplingParams(max_tokens=64, temperature=0))
# Record: tokens/s (generation log), "# GPU blocks" (startup log), outputs[:2].
run("Qwen/Qwen2.5-0.5B-Instruct") # fp16 baseline
run("Qwen/Qwen2.5-0.5B-Instruct", quantization="fp8") # W8A8, online conversion
run("<a -AWQ variant of a small model>") # W4A16, pre-quantized
For each run record the three meters. Then — the part that makes it science — predict the ordering of each column from the Background section before looking, and reconcile any miss.
Captured output (real run, Qwen2.5-0.5B, L4 24GB, vLLM 0.22.1, trimmed)
fp16 : Avg generation throughput: 9,800 tok/s # GPU blocks: 12,140
fp8 : Avg generation throughput: 14,200 tok/s # GPU blocks: 18,900 (W8A8: faster + more KV)
awq4 : Avg generation throughput: 12,600 tok/s # GPU blocks: 21,300 (weight-only: most KV room)
# outputs were near-identical in meaning across all three for this prompt.
Reading the numbers
- FP8 throughput (+45%) — both terms moving: weight bytes halved (bandwidth) and FP8 tensor cores engaged (compute). On an L4 (Ada — has FP8 units) this is the expected shape; on an A100 (no FP8 tensor cores) the same config falls back to less-favorable paths and the column shrinks. Hardware is a term in the model.
- AWQ blocks (21,300, the max) — 4-bit weights free the most HBM, and blocks ≈ concurrency (Phase 2 lab-03's arithmetic). For a chat service whose bottleneck is "how many users fit," this column is the decision, and AWQ wins it.
- AWQ throughput (12,600 — above fp16, below FP8) — the subtle row. Decode is bandwidth-bound at this batch, so 4× fewer weight bytes helps a lot; but every weight pays register dequant, and the matmul stays fp16 — so it can't catch FP8's tensor-core rate. At batch 64+ (more compute-bound), expect this gap to widen. If you predicted "4-bit must be fastest — fewest bytes!", the miss is the lesson: bytes only rule when bandwidth binds.
- Quality "near-identical" — at 0.5B and one prompt this is an eyeball check, not an eval. Treat it as "no catastrophic breakage," never as "quality verified." Real verification is a benchmark suite (lm-eval-harness, your domain evals) run per format — the quality column of this lab's table is the most expensive one to fill honestly, and the one most often skipped in production decisions. Don't be that deployment.
Hitchhiker's notes
quantization="fp8"converts at load time (weights rounded online, dynamic activation scales) — convenient but leaves quality on the table vs checkpoints with calibrated static scales (per lab-04: calibration finds the smoothing/scale constants). Prefer pre-quantized, calibrated checkpoints for production; use online mode for quick capacity experiments — exactly what this lab is.- The
# GPU blocksjump is free concurrency, not free latency. More blocks admit more simultaneous requests (throughput at constant hardware), but each request's decode speed only improves via the bandwidth/compute effects. Distinguish "serves more users" from "serves each user faster" — quantization does both, through different terms, in different amounts. - Format support is hardware-gated: FP8 needs Ada/Hopper+; AWQ/GPTQ kernels (Marlin et al.) have their own arch/shape support matrices; fallback paths are silent and slow. After any quantized deployment, check which kernel actually loaded (startup logs name the linear method) — Phase 4 lab-02's "read the dispatch line" habit, again.
- Small models exaggerate nothing — if anything they understate weight-only's value: at 0.5B, weights are a small fraction of HBM, so freeing 75% of them moves blocks modestly. At 70B on an 80 GB card, the same 4× is the difference between "doesn't fit" and "fits with room for 50 users." Scale the conclusion, not the numbers.
Reflect
- Why does FP8 raise both throughput and free KV blocks, while AWQ raises blocks more but throughput less? (Trace each format through the two terms: bytes-in-HBM and math-rate. Weight-only only touches bytes; W8A8 touches both but shrinks bytes less.)
- Your deployment: 70B, H100, p99-latency-sensitive, batch rarely above 4. Which format and why? (Bandwidth-bound regime → bytes rule → 4-bit weight-only is the latency play; FP8's tensor cores mostly help compute-bound batches. Now re-answer for a batch-128 offline summarization farm.)
- What experiment distinguishes "quality is fine" from "quality looks fine"? (A fixed eval set with metrics, run on base and quantized, diffed — with attention to tails: quantization damage concentrates in rare/hard cases that averages hide.)
References
upstream/vllm/model_executor/layers/quantization/— the format zoo's implementations; the README-level map is the deep-dive's §"format zoo."- vLLM docs, Quantization — supported formats × hardware matrix: https://docs.vllm.ai/en/latest/features/quantization/
- Lin et al., AWQ (2023): https://arxiv.org/abs/2306.00978; Xiao et al., SmoothQuant (2022): https://arxiv.org/abs/2211.10438 — the two families' canonical papers.
- NVIDIA, FP8 Formats for Deep Learning — why W8A8 became hardware-native: https://arxiv.org/abs/2209.05433
- EleutherAI, lm-evaluation-harness — how to fill the quality column honestly: https://github.com/EleutherAI/lm-evaluation-harness