Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 00-02 — KV-Cache Memory Calculator [CPU-OK]

Lab-01 ended with a cliffhanger: the KV cache converts quadratic compute into linear memory. This lab computes exactly how much memory — and the answer is the most important number in LLM serving economics: how many concurrent users fit on one GPU. You'll write the three-line formula, apply it to Llama-3-8B, and arrive at the genuinely shocking result that a 24 GiB GPU running an 8B model has room for only ~32 full-length conversations. Every dollar of inference cost, every "maximum concurrency" log line, and the entire existence of PagedAttention trace back to the arithmetic you're about to own.

Contents


Why this lab exists

This is back-of-envelope as a professional skill. A staff inference engineer gets asked, weekly, some variant of: "can we serve model X to Y users on hardware Z?" The wrong answer costs a fleet; the right answer is three multiplications you can do in a meeting. This lab installs the formula so deeply that you'll never again look at a GPU spec sheet without mentally dividing its HBM by a KV footprint.

It's also the Rosetta stone for the rest of the course. When Phase 2 lab-03's real engine prints Maximum concurrency for 2,048 tokens per request: 68.65x, that's this lab's max_concurrent_seqs evaluated against measured free HBM. When Phase 6 sells you FP8 KV, when model cards advertise GQA, when Phase 10 shards KV across GPUs — every one of those is an attack on a term of the formula you write here. Learn the formula, and the whole optimization landscape organizes itself into "which factor does this shrink?"

Background: where the bytes go

Per token, per layer, attention stores one K vector and one V vector, each of num_kv_heads × head_dim elements. Multiply it out:

kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × dtype_bytes
                     ▲       ▲             ▲            ▲           ▲
                  K and V  every layer   the heads   per head   fp16 = 2

Two things to notice before you code:

  • num_kv_heads, not num_heads. Modern models use grouped-query attention (GQA): many query heads share each KV head precisely because someone did this lab's math and realized KV memory, not model quality, capped serving capacity. Llama-3-8B has 32 query heads but only 8 KV heads — a 4× KV saving designed into the architecture. (MQA — one KV head — is the extreme; MLA in DeepSeek compresses further still. Architecture evolution is visible in this one parameter.)
  • It's per token, forever. The cache for token 0 lives until the request finishes. Length × concurrency × per-token bytes must fit in what's left after weights. There is no amortization, no compression by default — just bytes, held for the lifetime of the conversation.

Files

  • starter.py — implement kv_bytes_per_token, kv_bytes_per_seq, max_concurrent_seqs. Your work.
  • solution.py — reference.
  • test_lab.py — pins exact numbers for Llama-3-8B, the FP8 and GQA factor effects, and the no-room edge case.

Run

LAB_IMPL=starter pytest phase-00-foundations/labs/lab-02-kv-memory-calculator -q
pytest phase-00-foundations/labs/lab-02-kv-memory-calculator -q   # reference (default)

The formulas

kv_bytes_per_token = 2 (K and V) × num_layers × num_kv_heads × head_dim × dtype_bytes
kv_bytes_per_seq   = kv_bytes_per_token × seq_len
max_concurrent     = (gpu_bytes − weight_bytes) // kv_bytes_per_seq      (0 if no room)

Integer division on purpose: you cannot serve 0.7 of a conversation. (The real engine has the same floor, expressed in blocks — Phase 2.)

The headline result, walked through

Llama-3-8B in fp16: num_layers=32, num_kv_heads=8, head_dim=128, dtype_bytes=2.

per token : 2 × 32 × 8 × 128 × 2 = 131,072 B = 128 KiB        (!)
per 2,048-token sequence: 128 KiB × 2,048 = 256 MiB
24 GiB GPU − ~16 GiB weights = 8 GiB free
concurrency: 8 GiB / 256 MiB = 32 sequences

Sit with each line:

  • 128 KiB per token. A token is ~4 characters of text. Its cache costs as much as a small image. A 100-word answer: ~17 MB. This is why "just keep the conversation in memory" is a capacity strategy, not a triviality.
  • 256 MiB per max-length sequence — 1.6% of the entire weights per conversation, for context alone.
  • 32 users. An 8-billion-parameter model on a serious GPU and the ceiling is thirty two — and that's assuming perfect packing with zero waste. Now recall (or preview) Phase 2 lab-02: pre-vLLM engines wasted 60–80% of KV memory on fragmentation, turning 32 into ~6–12. Memory, not compute, is the binding constraint of LLM serving — the single most counterintuitive fact in the field, and you just derived it.
  • And the punchline that launches Phase 2: since the constraint is memory, the highest- leverage engineering target is making every byte of that 8 GiB hold useful KV. That is, verbatim, PagedAttention's mission statement.

What the tests prove

TestWhat it pins
test_llama3_8b_per_tokenThe exact 131,072 — get the factors and you get the field's most-quoted number
test_llama3_8b_concurrencyThe headline 32, end to end through all three functions
test_fp8_kv_doubles_concurrencydtype_bytes is a linear lever: halve it, double the users. (Phase 6's FP8-KV feature, justified in one assert)
test_gqa_saves_vs_mha8 vs 32 KV heads = exactly 4× — why GQA exists, as arithmetic
test_no_room_returns_zeroWeights ≥ HBM → 0, gracefully. Capacity functions must not return −3 users

Hitchhiker's notes

  • The formula is the optimization roadmap. Every KV-memory technique in production attacks one factor: FP8/INT4 KV quantization → dtype_bytes; GQA/MQA/MLA → num_kv_heads × head_dim; sliding-window attention (Mistral) and hybrid SSM layers (Phase 14) → which layers store KV at all; tensor parallelism (Phase 10) → divides the whole thing across GPUs; prefix caching (Phases 2–3) → shares kv_bytes_per_seq across requests. When you read a new inference paper, your first question is now: which factor?
  • Weights are the entry fee; KV is the rent. Weights are fixed and amortize over every request; KV scales with traffic. This is why bigger GPUs disproportionately help serving (more leftover after the fixed cost) and why a 70B model on an 80 GiB GPU (~140 GiB fp16 weights — doesn't even fit without quantization or sharding) is a different kind of problem than 8B on 24 GiB.
  • seq_len is the denominator you control. The formula uses worst-case length — exactly what the engine's startup "maximum concurrency" line assumes, and exactly why Phase 2 lab-03's reflection scolds the max_model_len=32768 config for a 4k workload. Capacity planning with the p99 length instead of the max is the cheapest 8× you'll ever find.
  • What the simple formula ignores (and where it bends in practice): activation scratch memory (vLLM measures this at startup by profiling — Phase 2 lab-03), block-granularity rounding (block_size − 1 tail waste per sequence — Phase 2), CUDA context overhead, and the fact that real workloads have a length distribution, so effective concurrency is higher than the max-length floor. The formula is your lower bound and your sanity check, not your final answer — which is also why vLLM computes blocks from measured free memory rather than trusting arithmetic.

Going further

  • Build a table for models you care about (3B/8B/70B; fp16/fp8 KV; 2k/8k/128k context) on A100-80G. Notice where concurrency drops below 1 — congratulations, you've discovered why 128k-context serving needs either massive HBM, KV offload, or architecture tricks, and why long-context pricing is what it is.
  • Invert the formula: given a target of 200 concurrent users at 4k context on 8B/fp16, how much free HBM do you need? (200 × 4096 × 128 KiB = 100 GiB → multi-GPU territory → Phase 10's tensor parallelism divides per-GPU KV by the shard count.)
  • Add block_size rounding from Phase 2 (ceil(seq_len / block_size) blocks per sequence) and quantify how little paging's tail waste costs vs the 60–80% it saves — reproducing Phase 2 lab-02's conclusion from the memory side.

References