Lab 00-02 — KV-Cache Memory Calculator [CPU-OK]
Lab-01 ended with a cliffhanger: the KV cache converts quadratic compute into linear memory. This lab computes exactly how much memory — and the answer is the most important number in LLM serving economics: how many concurrent users fit on one GPU. You'll write the three-line formula, apply it to Llama-3-8B, and arrive at the genuinely shocking result that a 24 GiB GPU running an 8B model has room for only ~32 full-length conversations. Every dollar of inference cost, every "maximum concurrency" log line, and the entire existence of PagedAttention trace back to the arithmetic you're about to own.
Contents
- Why this lab exists
- Background: where the bytes go
- Files
- Run
- The formulas
- The headline result, walked through
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This is back-of-envelope as a professional skill. A staff inference engineer gets asked, weekly, some variant of: "can we serve model X to Y users on hardware Z?" The wrong answer costs a fleet; the right answer is three multiplications you can do in a meeting. This lab installs the formula so deeply that you'll never again look at a GPU spec sheet without mentally dividing its HBM by a KV footprint.
It's also the Rosetta stone for the rest of the course. When Phase 2 lab-03's real engine
prints Maximum concurrency for 2,048 tokens per request: 68.65x, that's this lab's
max_concurrent_seqs evaluated against measured free HBM. When Phase 6 sells you FP8 KV,
when model cards advertise GQA, when Phase 10 shards KV across GPUs — every one of those is
an attack on a term of the formula you write here. Learn the formula, and the whole
optimization landscape organizes itself into "which factor does this shrink?"
Background: where the bytes go
Per token, per layer, attention stores one K vector and one V vector, each of
num_kv_heads × head_dim elements. Multiply it out:
kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × dtype_bytes
▲ ▲ ▲ ▲ ▲
K and V every layer the heads per head fp16 = 2
Two things to notice before you code:
num_kv_heads, notnum_heads. Modern models use grouped-query attention (GQA): many query heads share each KV head precisely because someone did this lab's math and realized KV memory, not model quality, capped serving capacity. Llama-3-8B has 32 query heads but only 8 KV heads — a 4× KV saving designed into the architecture. (MQA — one KV head — is the extreme; MLA in DeepSeek compresses further still. Architecture evolution is visible in this one parameter.)- It's per token, forever. The cache for token 0 lives until the request finishes. Length × concurrency × per-token bytes must fit in what's left after weights. There is no amortization, no compression by default — just bytes, held for the lifetime of the conversation.
Files
starter.py— implementkv_bytes_per_token,kv_bytes_per_seq,max_concurrent_seqs. Your work.solution.py— reference.test_lab.py— pins exact numbers for Llama-3-8B, the FP8 and GQA factor effects, and the no-room edge case.
Run
LAB_IMPL=starter pytest phase-00-foundations/labs/lab-02-kv-memory-calculator -q
pytest phase-00-foundations/labs/lab-02-kv-memory-calculator -q # reference (default)
The formulas
kv_bytes_per_token = 2 (K and V) × num_layers × num_kv_heads × head_dim × dtype_bytes
kv_bytes_per_seq = kv_bytes_per_token × seq_len
max_concurrent = (gpu_bytes − weight_bytes) // kv_bytes_per_seq (0 if no room)
Integer division on purpose: you cannot serve 0.7 of a conversation. (The real engine has the same floor, expressed in blocks — Phase 2.)
The headline result, walked through
Llama-3-8B in fp16: num_layers=32, num_kv_heads=8, head_dim=128, dtype_bytes=2.
per token : 2 × 32 × 8 × 128 × 2 = 131,072 B = 128 KiB (!)
per 2,048-token sequence: 128 KiB × 2,048 = 256 MiB
24 GiB GPU − ~16 GiB weights = 8 GiB free
concurrency: 8 GiB / 256 MiB = 32 sequences
Sit with each line:
- 128 KiB per token. A token is ~4 characters of text. Its cache costs as much as a small image. A 100-word answer: ~17 MB. This is why "just keep the conversation in memory" is a capacity strategy, not a triviality.
- 256 MiB per max-length sequence — 1.6% of the entire weights per conversation, for context alone.
- 32 users. An 8-billion-parameter model on a serious GPU and the ceiling is thirty two — and that's assuming perfect packing with zero waste. Now recall (or preview) Phase 2 lab-02: pre-vLLM engines wasted 60–80% of KV memory on fragmentation, turning 32 into ~6–12. Memory, not compute, is the binding constraint of LLM serving — the single most counterintuitive fact in the field, and you just derived it.
- And the punchline that launches Phase 2: since the constraint is memory, the highest- leverage engineering target is making every byte of that 8 GiB hold useful KV. That is, verbatim, PagedAttention's mission statement.
What the tests prove
| Test | What it pins |
|---|---|
test_llama3_8b_per_token | The exact 131,072 — get the factors and you get the field's most-quoted number |
test_llama3_8b_concurrency | The headline 32, end to end through all three functions |
test_fp8_kv_doubles_concurrency | dtype_bytes is a linear lever: halve it, double the users. (Phase 6's FP8-KV feature, justified in one assert) |
test_gqa_saves_vs_mha | 8 vs 32 KV heads = exactly 4× — why GQA exists, as arithmetic |
test_no_room_returns_zero | Weights ≥ HBM → 0, gracefully. Capacity functions must not return −3 users |
Hitchhiker's notes
- The formula is the optimization roadmap. Every KV-memory technique in production
attacks one factor: FP8/INT4 KV quantization →
dtype_bytes; GQA/MQA/MLA →num_kv_heads × head_dim; sliding-window attention (Mistral) and hybrid SSM layers (Phase 14) → which layers store KV at all; tensor parallelism (Phase 10) → divides the whole thing across GPUs; prefix caching (Phases 2–3) → shareskv_bytes_per_seqacross requests. When you read a new inference paper, your first question is now: which factor? - Weights are the entry fee; KV is the rent. Weights are fixed and amortize over every request; KV scales with traffic. This is why bigger GPUs disproportionately help serving (more leftover after the fixed cost) and why a 70B model on an 80 GiB GPU (~140 GiB fp16 weights — doesn't even fit without quantization or sharding) is a different kind of problem than 8B on 24 GiB.
seq_lenis the denominator you control. The formula uses worst-case length — exactly what the engine's startup "maximum concurrency" line assumes, and exactly why Phase 2 lab-03's reflection scolds themax_model_len=32768config for a 4k workload. Capacity planning with the p99 length instead of the max is the cheapest 8× you'll ever find.- What the simple formula ignores (and where it bends in practice): activation
scratch memory (vLLM measures this at startup by profiling — Phase 2 lab-03),
block-granularity rounding (
block_size − 1tail waste per sequence — Phase 2), CUDA context overhead, and the fact that real workloads have a length distribution, so effective concurrency is higher than the max-length floor. The formula is your lower bound and your sanity check, not your final answer — which is also why vLLM computes blocks from measured free memory rather than trusting arithmetic.
Going further
- Build a table for models you care about (3B/8B/70B; fp16/fp8 KV; 2k/8k/128k context) on A100-80G. Notice where concurrency drops below 1 — congratulations, you've discovered why 128k-context serving needs either massive HBM, KV offload, or architecture tricks, and why long-context pricing is what it is.
- Invert the formula: given a target of 200 concurrent users at 4k context on 8B/fp16, how much free HBM do you need? (200 × 4096 × 128 KiB = 100 GiB → multi-GPU territory → Phase 10's tensor parallelism divides per-GPU KV by the shard count.)
- Add
block_sizerounding from Phase 2 (ceil(seq_len / block_size)blocks per sequence) and quantify how little paging's tail waste costs vs the 60–80% it saves — reproducing Phase 2 lab-02's conclusion from the memory side.
References
- kipply, Transformer Inference Arithmetic — the classic source for exactly this per-token math: https://kipp.ly/transformer-inference-arithmetic/
- Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models (2023) — why
num_kv_headsshrank industry-wide: https://arxiv.org/abs/2305.13245 - Pope et al., Efficiently Scaling Transformer Inference (2022) — KV memory as the scaling bottleneck, formalized: https://arxiv.org/abs/2211.05102
- Kwon et al., PagedAttention (SOSP 2023) — what happens next to these bytes: https://arxiv.org/abs/2309.06180
upstream/vllm/v1/core/kv_cache_utils.py::get_kv_cache_configs— your formula, running at every engine startup (Phase 2 lab-03 reads its output).