Lab 06-03 — Int4: Group-Wise Scales and Nibble Packing [CPU-OK]
Lab-01's int8 was the gentle slope: 255 levels, per-channel scales, <1% error, everyone
goes home happy. Int4 is the cliff: 15 usable levels. At that resolution, the
per-channel scale that saved you in lab-01 is no longer fine enough — one loud weight
anywhere in a row crushes the whole row into 2–3 effective levels. Survival requires two
new mechanisms, and you'll build both: group-wise scales (one scale per 128-ish
consecutive weights, not per row — the group_size in every GPTQ/AWQ model card you've
ever skimmed) and nibble packing (two int4 values per byte — the actual bit-level
layout of the checkpoint files). When the tests pass, you can read a 4-bit quantized
safetensors file's shapes and know exactly why every tensor is the size it is.
Contents
- Why this lab exists
- Background: why 15 levels changes the game
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Model cards say things like W4A16, group_size=128, sym and most engineers parse it as
incantation. After this lab it parses as engineering: 4-bit symmetric weights (your
[-7, 7] clip), one fp16 scale per 128 consecutive weights (your scales tensor), and a
storage cost you can compute in your head (0.5 bytes/weight + 2/128 bytes of scale ≈
0.516 bytes/weight ≈ 7.8× smaller than fp32, ~3.9× smaller than fp16). That
arithmetic is the literal reason a 70B model fits on a single 48 GB card — and per
Phase 0 lab-04's roofline, it's also a ~4× decode speedup ceiling, since decode is
bandwidth-bound and you just shrank the bytes.
The packing half matters for a different reason: it's your first contact with the gap between logical values and physical layout, which is most of what kernel-side quantization code does. The CUDA kernels that consume these weights (AWQ/GPTQ/Marlin — Phase 7 adjacent) spend most of their cleverness unpacking nibbles into tensor-core- friendly tiles fast enough to stay bandwidth-bound. You'll write the readable version; knowing it makes the unreadable versions readable.
Background: why 15 levels changes the game
Quantization error per weight is roughly scale / √12 (uniform rounding error), and
scale = max|covered weights| / 7 for int4. The denominator 7 (vs 127 for int8) means
the scale is ~18× coarser at the same coverage — so the only lever left is shrinking
the coverage: make each scale cover fewer weights, so max|covered| tracks the local
magnitude instead of the row-wide loudest value. That's all "group_size" is: the coverage
window. The trade is pure and quantifiable:
- group 128 → 1 fp16 scale per 128 weights: 1.6% storage overhead, decent locality.
- group 16 → 8× more scales (3.1% per-weight overhead → 12.5% of the weight bits!),
better locality, lower error —
test_smaller_groups_capture_local_magnitudemeasures the win,test_group_scale_overhead_is_the_tradeoffmeasures the bill.
Industry settled on 128 because real weight matrices' magnitude structure varies at roughly that granularity — empiricism, not theory. (GPTQ and AWQ both add a second idea on which values to round which way — error-compensating rounding and activation-aware scale selection respectively — but the storage format you're building is what they both emit.)
Files
starter.py—quantize_grouped,dequantize_grouped,pack_int4,unpack_int4,memory_bytes_grouped. Your work.solution.py— reference.test_lab.py— exact pack/unpack round-trip, bounded int4 error, the fine-vs-coarse-group comparison, the ~8× memory ratio, and the scale-overhead bill.
Run
LAB_IMPL=starter pytest phase-06-quantization/labs/lab-03-int4-groups-and-packing -q
pytest phase-06-quantization/labs/lab-03-int4-groups-and-packing -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_pack_unpack_roundtrip_is_exact | The bit gymnastics (offset-by-8, low/high nibble) are lossless — packing is layout, never approximation. Note the shape check: (16, 64) → (16, 32) uint8, exactly the shape you'll see in a real checkpoint |
test_grouped_roundtrip_error_bounded | Int4 with group 32 lands < 15% relative error on Gaussian weights — coarse, but bounded and predictable; values clip at ±7 as designed |
test_smaller_groups_capture_local_magnitude | On weights with banded magnitude (the realistic case), group 16 beats group 256 — the entire reason groups exist |
test_memory_about_8x_smaller_than_fp32 | The model-card arithmetic: > 7× vs fp32 with group-128 fp16 scales |
test_group_scale_overhead_is_the_tradeoff | Group 16 stores exactly 8× the scales of group 128 — the other side of the ledger |
Hitchhiker's notes
- Why ±7 and not ±8? Int4 spans [−8, 7]; symmetric quantization sacrifices −8 to
keep the grid symmetric around zero (so
q = 0 ⇔ w ≈ 0and negation is exact). Some formats keep −8 (asymmetric, with zero-points); the model card'ssymflag is exactly this choice. You implementedsym; the asymmetric variant adds a per-groupzero_point— a 10-line extension worth doing once (see Going further). - Packing order is a convention, and conventions bite. You packed
even-index→low-nibble; AWQ's layout interleaves differently (an order chosen so the
GPU kernel's unpack lands values where tensor cores want them). When a checkpoint
loads garbage through the wrong kernel, mismatched nibble order is a classic cause —
the data is fine, the convention differs. This is why vLLM's loader maps
quant_methodstrings to specific weight-layout handlers (upstream/vllm/ model_executor/layers/quantization/). - Where dequant actually happens: not in your tidy
dequantize_grouped— that materializes the fp matrix and forfeits the bandwidth win. Real kernels (Marlin being the canonical one) unpack + scale inside the GEMM, in registers, fused with the multiply. Weight-only quant's speedup story is entirely "fewer HBM bytes," which only survives if the unpacking never round-trips through memory. Same lesson as Phase 2 lab-06's "your gather is a memcpy the GPU never does." - KV-cache quantization uses the same per-group machinery (Phase 0 lab-02's
dtype_byteslever): fp8 KV with per-head or per-token scales. Once you've built grouped quant for weights, the KV variant is the same code pointed at a different tensor — which is roughly how upstream implements it too.
Going further
- Add asymmetric quantization (
zero_pointper group) and measure error on a shifted distribution (W + 0.3): symmetric wastes half its range on values that never occur; asymmetric recovers it. Then check which one GGUF's common formats use (both exist in the zoo). - Implement the fused path:
quant_matmul(x, packed, scales)that unpacks one group at a time and accumulates, never materializing the full W. Same answer, different peak memory — measure both. - Plot relative error vs group_size ∈ {8, 16, 32, 64, 128, 256, 1024} for banded weights, with a second line for storage overhead. The crossing region is why 128 won. Then read an actual AWQ config.json and find every number you now understand.
References
- Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022) — error-compensating rounding atop this storage format: https://arxiv.org/abs/2210.17323
- Lin et al., AWQ: Activation-aware Weight Quantization (2023) — protecting the ~1% salient weights via scale selection: https://arxiv.org/abs/2306.00978
- Frantar et al., Marlin: a fast 4-bit kernel — what consuming this format at speed takes: https://github.com/IST-DASLab/marlin
upstream/vllm/model_executor/layers/quantization/— the format zoo's loaders; findgroup_sizeinawq.pyandgptq.py.- Lab-01 — the int8 baseline this lab degrades gracefully from; lab-04 — what happens when activations join the party.