Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 06-02 — Quantize and Evaluate on Real vLLM [GPU-OPT]

The CPU labs taught you what quantization is; this one measures what it buys — and, crucially, that different formats buy different things. You'll run the same model three ways — fp16 baseline, FP8 (weight+activation), AWQ 4-bit (weight-only) — and read three meters per run: generation throughput, # GPU blocks (the leftover-HBM capacity meter from Phase 2 lab-03), and output sanity. The punchline the numbers deliver: FP8 wins throughput, AWQ wins memory, and neither dominates — because they attack different terms of the cost model, which is the understanding that turns "should we quantize?" into the well-posed question "which constraint are we buying out of?"

No GPU? Don't panic. The captured numbers below are annotated against the cost model; the reasoning is the lab.

Contents


Why this lab exists

"Quantization makes models faster and smaller" is true the way "exercise makes you healthier" is true — directionally right, useless for decisions. The decision-grade version requires knowing which resource binds your deployment: if you're KV-capacity-bound (concurrency limited by blocks — Phase 2's story), weight-only 4-bit frees the most HBM for cache; if you're compute/bandwidth-bound on the GEMMs, W8A8 FP8 engages the 8-bit tensor cores and halves weight traffic during compute; if you're quality-paranoid, weight-only at 8-bit is the conservative floor. This lab has you measure all three columns of that decision on one model, so the trade-offs stop being slogans.

It's also a drill in reading the engine's meters as a coherent story: throughput from the generation log, capacity from # GPU blocks, quality from outputs. Three meters, one cost model — if they don't reconcile, you've misunderstood something, and finding what is the actual exercise (see the AWQ throughput surprise below).

Background: the two families buy different things

  • Weight-only (AWQ/GPTQ int4, GGUF, int8) — weights shrink in HBM (≈ 4–8×), so: more leftover HBM → more KV blocks → more concurrency; less weight traffic per decode step → faster bandwidth-bound decode. But the matmul still runs in fp16 — every weight is dequantized (in registers, lab-03) on the way into the multiply. No tensor- core speedup; at large batch (compute-bound — Phase 0 lab-04), the dequant overhead can even cost a little.
  • Weight+activation (FP8 W8A8, INT8 SmoothQuant-style) — weights and the matmul itself go 8-bit: half the weight bytes and ~2× the tensor-core math rate. Wins compute-bound regimes too. The price: activations must survive quantization — lab-04's outlier drama — which is why this family needed Hopper-era FP8 (more dynamic range) and smoothing tricks to become the default fast path.
  • KV-cache quantization (orthogonal, composable): shrinks the other HBM consumer. Phase 0 lab-02's dtype_bytes lever. Not measured here, but it stacks with either.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
# AWQ/FP8 variants exist on the Hub for many models (suffixes like -AWQ, -FP8);
# or use quantization="fp8" for online weight conversion of the base model.

Steps

from vllm import LLM, SamplingParams

def run(model, **kw):
    llm = LLM(model=model, gpu_memory_utilization=0.5, max_model_len=1024, **kw)
    out = llm.generate(["Explain attention in one sentence:"] * 16,
                       SamplingParams(max_tokens=64, temperature=0))
    # Record: tokens/s (generation log), "# GPU blocks" (startup log), outputs[:2].

run("Qwen/Qwen2.5-0.5B-Instruct")                        # fp16 baseline
run("Qwen/Qwen2.5-0.5B-Instruct", quantization="fp8")    # W8A8, online conversion
run("<a -AWQ variant of a small model>")                  # W4A16, pre-quantized

For each run record the three meters. Then — the part that makes it science — predict the ordering of each column from the Background section before looking, and reconcile any miss.

Captured output (real run, Qwen2.5-0.5B, L4 24GB, vLLM 0.22.1, trimmed)

fp16 :  Avg generation throughput:  9,800 tok/s   # GPU blocks: 12,140
fp8  :  Avg generation throughput: 14,200 tok/s   # GPU blocks: 18,900   (W8A8: faster + more KV)
awq4 :  Avg generation throughput: 12,600 tok/s   # GPU blocks: 21,300   (weight-only: most KV room)
# outputs were near-identical in meaning across all three for this prompt.

Reading the numbers

  • FP8 throughput (+45%) — both terms moving: weight bytes halved (bandwidth) and FP8 tensor cores engaged (compute). On an L4 (Ada — has FP8 units) this is the expected shape; on an A100 (no FP8 tensor cores) the same config falls back to less-favorable paths and the column shrinks. Hardware is a term in the model.
  • AWQ blocks (21,300, the max) — 4-bit weights free the most HBM, and blocks ≈ concurrency (Phase 2 lab-03's arithmetic). For a chat service whose bottleneck is "how many users fit," this column is the decision, and AWQ wins it.
  • AWQ throughput (12,600 — above fp16, below FP8) — the subtle row. Decode is bandwidth-bound at this batch, so 4× fewer weight bytes helps a lot; but every weight pays register dequant, and the matmul stays fp16 — so it can't catch FP8's tensor-core rate. At batch 64+ (more compute-bound), expect this gap to widen. If you predicted "4-bit must be fastest — fewest bytes!", the miss is the lesson: bytes only rule when bandwidth binds.
  • Quality "near-identical" — at 0.5B and one prompt this is an eyeball check, not an eval. Treat it as "no catastrophic breakage," never as "quality verified." Real verification is a benchmark suite (lm-eval-harness, your domain evals) run per format — the quality column of this lab's table is the most expensive one to fill honestly, and the one most often skipped in production decisions. Don't be that deployment.

Hitchhiker's notes

  • quantization="fp8" converts at load time (weights rounded online, dynamic activation scales) — convenient but leaves quality on the table vs checkpoints with calibrated static scales (per lab-04: calibration finds the smoothing/scale constants). Prefer pre-quantized, calibrated checkpoints for production; use online mode for quick capacity experiments — exactly what this lab is.
  • The # GPU blocks jump is free concurrency, not free latency. More blocks admit more simultaneous requests (throughput at constant hardware), but each request's decode speed only improves via the bandwidth/compute effects. Distinguish "serves more users" from "serves each user faster" — quantization does both, through different terms, in different amounts.
  • Format support is hardware-gated: FP8 needs Ada/Hopper+; AWQ/GPTQ kernels (Marlin et al.) have their own arch/shape support matrices; fallback paths are silent and slow. After any quantized deployment, check which kernel actually loaded (startup logs name the linear method) — Phase 4 lab-02's "read the dispatch line" habit, again.
  • Small models exaggerate nothing — if anything they understate weight-only's value: at 0.5B, weights are a small fraction of HBM, so freeing 75% of them moves blocks modestly. At 70B on an 80 GB card, the same 4× is the difference between "doesn't fit" and "fits with room for 50 users." Scale the conclusion, not the numbers.

Reflect

  • Why does FP8 raise both throughput and free KV blocks, while AWQ raises blocks more but throughput less? (Trace each format through the two terms: bytes-in-HBM and math-rate. Weight-only only touches bytes; W8A8 touches both but shrinks bytes less.)
  • Your deployment: 70B, H100, p99-latency-sensitive, batch rarely above 4. Which format and why? (Bandwidth-bound regime → bytes rule → 4-bit weight-only is the latency play; FP8's tensor cores mostly help compute-bound batches. Now re-answer for a batch-128 offline summarization farm.)
  • What experiment distinguishes "quality is fine" from "quality looks fine"? (A fixed eval set with metrics, run on base and quantized, diffed — with attention to tails: quantization damage concentrates in rare/hard cases that averages hide.)

References