Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 06-03 — Int4: Group-Wise Scales and Nibble Packing [CPU-OK]

Lab-01's int8 was the gentle slope: 255 levels, per-channel scales, <1% error, everyone goes home happy. Int4 is the cliff: 15 usable levels. At that resolution, the per-channel scale that saved you in lab-01 is no longer fine enough — one loud weight anywhere in a row crushes the whole row into 2–3 effective levels. Survival requires two new mechanisms, and you'll build both: group-wise scales (one scale per 128-ish consecutive weights, not per row — the group_size in every GPTQ/AWQ model card you've ever skimmed) and nibble packing (two int4 values per byte — the actual bit-level layout of the checkpoint files). When the tests pass, you can read a 4-bit quantized safetensors file's shapes and know exactly why every tensor is the size it is.

Contents


Why this lab exists

Model cards say things like W4A16, group_size=128, sym and most engineers parse it as incantation. After this lab it parses as engineering: 4-bit symmetric weights (your [-7, 7] clip), one fp16 scale per 128 consecutive weights (your scales tensor), and a storage cost you can compute in your head (0.5 bytes/weight + 2/128 bytes of scale ≈ 0.516 bytes/weight ≈ 7.8× smaller than fp32, ~3.9× smaller than fp16). That arithmetic is the literal reason a 70B model fits on a single 48 GB card — and per Phase 0 lab-04's roofline, it's also a ~4× decode speedup ceiling, since decode is bandwidth-bound and you just shrank the bytes.

The packing half matters for a different reason: it's your first contact with the gap between logical values and physical layout, which is most of what kernel-side quantization code does. The CUDA kernels that consume these weights (AWQ/GPTQ/Marlin — Phase 7 adjacent) spend most of their cleverness unpacking nibbles into tensor-core- friendly tiles fast enough to stay bandwidth-bound. You'll write the readable version; knowing it makes the unreadable versions readable.

Background: why 15 levels changes the game

Quantization error per weight is roughly scale / √12 (uniform rounding error), and scale = max|covered weights| / 7 for int4. The denominator 7 (vs 127 for int8) means the scale is ~18× coarser at the same coverage — so the only lever left is shrinking the coverage: make each scale cover fewer weights, so max|covered| tracks the local magnitude instead of the row-wide loudest value. That's all "group_size" is: the coverage window. The trade is pure and quantifiable:

  • group 128 → 1 fp16 scale per 128 weights: 1.6% storage overhead, decent locality.
  • group 16 → 8× more scales (3.1% per-weight overhead → 12.5% of the weight bits!), better locality, lower error — test_smaller_groups_capture_local_magnitude measures the win, test_group_scale_overhead_is_the_tradeoff measures the bill.

Industry settled on 128 because real weight matrices' magnitude structure varies at roughly that granularity — empiricism, not theory. (GPTQ and AWQ both add a second idea on which values to round which way — error-compensating rounding and activation-aware scale selection respectively — but the storage format you're building is what they both emit.)

Files

  • starter.pyquantize_grouped, dequantize_grouped, pack_int4, unpack_int4, memory_bytes_grouped. Your work.
  • solution.py — reference.
  • test_lab.py — exact pack/unpack round-trip, bounded int4 error, the fine-vs-coarse-group comparison, the ~8× memory ratio, and the scale-overhead bill.

Run

LAB_IMPL=starter pytest phase-06-quantization/labs/lab-03-int4-groups-and-packing -q
pytest phase-06-quantization/labs/lab-03-int4-groups-and-packing -q   # reference

What the tests prove

TestWhat it pins
test_pack_unpack_roundtrip_is_exactThe bit gymnastics (offset-by-8, low/high nibble) are lossless — packing is layout, never approximation. Note the shape check: (16, 64) → (16, 32) uint8, exactly the shape you'll see in a real checkpoint
test_grouped_roundtrip_error_boundedInt4 with group 32 lands < 15% relative error on Gaussian weights — coarse, but bounded and predictable; values clip at ±7 as designed
test_smaller_groups_capture_local_magnitudeOn weights with banded magnitude (the realistic case), group 16 beats group 256 — the entire reason groups exist
test_memory_about_8x_smaller_than_fp32The model-card arithmetic: > 7× vs fp32 with group-128 fp16 scales
test_group_scale_overhead_is_the_tradeoffGroup 16 stores exactly 8× the scales of group 128 — the other side of the ledger

Hitchhiker's notes

  • Why ±7 and not ±8? Int4 spans [−8, 7]; symmetric quantization sacrifices −8 to keep the grid symmetric around zero (so q = 0 ⇔ w ≈ 0 and negation is exact). Some formats keep −8 (asymmetric, with zero-points); the model card's sym flag is exactly this choice. You implemented sym; the asymmetric variant adds a per-group zero_point — a 10-line extension worth doing once (see Going further).
  • Packing order is a convention, and conventions bite. You packed even-index→low-nibble; AWQ's layout interleaves differently (an order chosen so the GPU kernel's unpack lands values where tensor cores want them). When a checkpoint loads garbage through the wrong kernel, mismatched nibble order is a classic cause — the data is fine, the convention differs. This is why vLLM's loader maps quant_method strings to specific weight-layout handlers (upstream/vllm/ model_executor/layers/quantization/).
  • Where dequant actually happens: not in your tidy dequantize_grouped — that materializes the fp matrix and forfeits the bandwidth win. Real kernels (Marlin being the canonical one) unpack + scale inside the GEMM, in registers, fused with the multiply. Weight-only quant's speedup story is entirely "fewer HBM bytes," which only survives if the unpacking never round-trips through memory. Same lesson as Phase 2 lab-06's "your gather is a memcpy the GPU never does."
  • KV-cache quantization uses the same per-group machinery (Phase 0 lab-02's dtype_bytes lever): fp8 KV with per-head or per-token scales. Once you've built grouped quant for weights, the KV variant is the same code pointed at a different tensor — which is roughly how upstream implements it too.

Going further

  • Add asymmetric quantization (zero_point per group) and measure error on a shifted distribution (W + 0.3): symmetric wastes half its range on values that never occur; asymmetric recovers it. Then check which one GGUF's common formats use (both exist in the zoo).
  • Implement the fused path: quant_matmul(x, packed, scales) that unpacks one group at a time and accumulates, never materializing the full W. Same answer, different peak memory — measure both.
  • Plot relative error vs group_size ∈ {8, 16, 32, 64, 128, 256, 1024} for banded weights, with a second line for storage overhead. The crossing region is why 128 won. Then read an actual AWQ config.json and find every number you now understand.

References

  • Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022) — error-compensating rounding atop this storage format: https://arxiv.org/abs/2210.17323
  • Lin et al., AWQ: Activation-aware Weight Quantization (2023) — protecting the ~1% salient weights via scale selection: https://arxiv.org/abs/2306.00978
  • Frantar et al., Marlin: a fast 4-bit kernel — what consuming this format at speed takes: https://github.com/IST-DASLab/marlin
  • upstream/vllm/model_executor/layers/quantization/ — the format zoo's loaders; find group_size in awq.py and gptq.py.
  • Lab-01 — the int8 baseline this lab degrades gracefully from; lab-04 — what happens when activations join the party.