Lab 00-04 — Prefill vs Decode: the Roofline Arithmetic [CPU-OK]
Lab-01 showed you that caching splits generation into two phases. This lab shows you the strange physics those phases live under — with six one-line functions and an A100 spec sheet. You'll compute the ridge point (the GPU's FLOP-per-byte ratio), discover that a single decoding sequence uses less than 1% of the GPU's compute, derive the hard speed limit of decode (125 tokens/s for an 8B model on an A100 — no kernel wizardry can beat it), and find the critical batch size where decode finally becomes compute-bound. These four numbers are the worldview of performance engineering; everything in Phase 18 is this lab with profilers attached.
Contents
- Why this lab exists
- Background: the roofline model in five minutes
- Files
- Run
- What to implement
- The numbers, walked through
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
There is a question that separates engineers who tune inference systems from engineers who guess at them: "is this workload compute-bound or memory-bound?" Every optimization belongs to one regime or the other — a faster matmul kernel does nothing for a bandwidth-bound decode; more batching does nothing for a compute-bound prefill — and applying a fix from the wrong regime is the most common way smart people waste a quarter. The roofline model answers the question with division. This lab makes you do the division until it's reflexive.
It also explains, from first principles, the economics you've been absorbing all phase: why batch size is the master lever of throughput (lab and Phase 1 lab-04), why continuous batching was worth inventing (Phase 3), why chunked prefill mixes the two phases in one step (Phase 3 lab-05 — piggybacking compute-hungry prefill onto bandwidth-starved decode steps), and why speculative decoding (Phase 8) can "spend FLOPs to save time" — the FLOPs were idle anyway.
Background: the roofline model in five minutes
Every computation has an arithmetic intensity: FLOPs performed per byte moved from memory. Every processor has a ridge point: peak FLOPs ÷ memory bandwidth. The roofline law is one comparison:
- intensity < ridge → memory-bound: the compute units finish early and wait for the bus. Speed = bandwidth × intensity. More FLOPs won't help; fewer bytes will.
- intensity ≥ ridge → compute-bound: the bus keeps up; you're using the silicon. Speed = peak FLOPs. Fewer bytes won't help; better kernels / more hardware will.
Our minimal model of a transformer step: it must read all weights once (n_params × dtype_bytes from HBM — weights don't fit in cache) and performs ~2 FLOPs per parameter
per token (each weight participates in one multiply-accumulate per token). So:
intensity = 2 · n_params · num_tokens / (n_params · dtype_bytes) = 2 · num_tokens / dtype_bytes
The parameter count cancels. Intensity doesn't care how big the model is — only how many tokens share one trip through the weights. That cancellation is the single most clarifying fact in inference performance: prefill (thousands of tokens per trip) and decode (one token per trip per sequence) aren't two workloads that differ in degree; they're the two opposite ends of the roofline, by construction.
(The model ignores KV/activation traffic and attention FLOPs — see Hitchhiker's notes for when that bites. As a first-order tool it's startlingly accurate.)
Files
starter.py— six functions, each ~one line, each a load-bearing concept. Your work.solution.py— reference.test_lab.py— A100 numbers, the cancellation, both regimes, the crossover, and the decode speed limit.
Run
LAB_IMPL=starter pytest phase-00-foundations/labs/lab-04-prefill-vs-decode -q
pytest phase-00-foundations/labs/lab-04-prefill-vs-decode -q # reference (default)
The numbers, walked through
A100-80GB SXM: 312 TFLOPs fp16, 2.0 TB/s HBM. 8B model, fp16 (16 GB of weights).
ridge : 312e12 / 2.0e12 = 156 FLOPs/byte
decode, batch 1 : intensity = 2·1/2 = 1 → 1 ≪ 156: bandwidth-bound, using 1/156 ≈ 0.6% of compute
prefill, 2048 tokens : intensity = 2048 → ≫ 156: compute-bound
crossover : 156 tokens/step → the "critical batch size"
speed limit: 2.0e12 / 16e9 = 125 steps/s → 125 tok/s per sequence, max, ever
Read them like an engineer:
- 0.6% compute utilization for single-stream decode. The other 99.4% of a $15k GPU is structurally idle — not because of bad kernels, but because one token per weight-trip is all the arithmetic the workload offers. This is the number that makes batching non-optional: every additional sequence in the decode batch reuses the same weight bytes, adding intensity ~1 per sequence, for free until you hit the ridge at ~156. Batch 64 → 64× the tokens/s at essentially the same step time. That free lunch is the entire economic basis of serving (and of this course's obsession with fitting more sequences in memory — lab-02 — since memory is what caps the batch).
- 125 tokens/s is a physics ceiling, not an engineering one: a decode step cannot
complete faster than the weights can stream from HBM. Measure any well-tuned 8B/fp16
deployment and single-stream decode sits at 60–80% of this bound (the rest: KV reads,
kernel overheads). When someone promises 500 tok/s single-stream on this hardware,
they're describing quantization (fewer bytes — note
dtype_bytesin your formula), speculation (Phase 8), or fiction. - Prefill at 2048 is compute-bound — which is why TTFT responds to better kernels and
FlashAttention (Phase 4), while ITL responds to memory bandwidth and quantization. Two
metrics, two regimes, two completely disjoint optimization menus. Now you know why
Phase 3's chunked prefill mixes the phases in one batch: decode steps have idle FLOPs;
prefill chunks are pure FLOPs; together they fill the roofline from both sides
(Sarathi's "piggybacking", which you measured as
[33, 33, …]in Phase 3 lab-05). - The crossover at 156 is worth memorizing as a shape, not a number: it moves with hardware (H100: ~295 fp16; consumer cards: lower) and dtype. "Decode needs ~ridge-many tokens per step to saturate compute" is the portable version.
What the tests prove
| Test | What it pins |
|---|---|
test_a100_ridge_point | 156.0 — the one hardware constant of this lab |
test_intensity_is_just_tokens_over_dtype | The cancellation: 8B and 70B give identical intensity. If this surprises you, reread the Background — it's the lab's central fact |
test_single_decode_is_hopelessly_bandwidth_bound | Intensity 1 vs ridge 156 |
test_prefill_is_compute_bound | Intensity 2048 vs ridge 156 — same model, opposite regime |
test_critical_batch_size_is_the_ridge | 155 tokens: memory-bound; 156: compute-bound. The crossover is exactly the ridge, because intensity (fp16) = tokens |
test_decode_speed_limit_8b_fp16 | 125 tok/s — bandwidth over weight bytes, the unbeatable ceiling |
test_batching_multiplies_decode_throughput_for_free | Batch 64 → 8000 tok/s from the same weight stream — the free lunch, quantified |
Hitchhiker's notes
- Where the weights-only model bends: at long contexts, KV reads become the dominant bytes (128 KiB/token from lab-02 × thousands of tokens × batch — eventually exceeding the 16 GB weight read!). That's why long-context decode gets slower per token even though "the model is the same size," why GQA/MLA exist (shrink KV bytes), and why Phase 18 extends this lab's model with a KV-traffic term. First-order tools, knowingly applied, then refined — that's the discipline.
- Why ~2 FLOPs per param per token? Each parameter sits in some matrix; processing a token multiplies it by one activation and adds into an accumulator — one FMA = 2 FLOPs. Attention adds FLOPs quadratic in sequence length on top (it's parameter-free, so it escapes this accounting); for short-to-moderate contexts the linear layers dominate and 2·N·T is a good model. The famous training version of the same estimate is 6·N·T (forward + backward); inference keeps only the 2.
- Quantization through this lens: INT4 weights = 4 GB to stream = 500 steps/s ceiling, 4× decode speedup with zero kernel cleverness — bandwidth-bound workloads reward byte-shrinking one-for-one. But the same INT4 does ~nothing for compute-bound prefill (the FLOPs still happen, often in fp16 after dequant). One optimization, two regimes, two completely different value propositions — Phase 6 lives here.
- The ridge explains CUDA graphs too (Phase 5): your 125 steps/s ceiling means a decode step takes ≥ 8 ms on this 8B model (16 GB streamed at 2 TB/s) — at that scale a few hundred microseconds of kernel-launch overhead is noise. Shrink the model to 1B and steps drop toward 1 ms; suddenly launch overhead is a first-order cost, and capturing the whole step as one CUDA graph pays for itself. The roofline tells you when overhead optimizations matter, too.
Going further
- Recompute everything for an H100 SXM (~990 TFLOPs fp16, ~3.35 TB/s): ridge ≈ 295, 8B decode ceiling ≈ 209 tok/s. Notice the ridge rose — new GPUs gain FLOPs faster than bandwidth, so decode gets relatively more memory-bound every generation. That trend line is why KV/weight compression research keeps accelerating.
- Add a
kv_bytes_per_step(batch, context_len, kv_per_token)term (from lab-02) todecode_tokens_per_secondand find the context length where KV traffic overtakes the weight read for batch 64. You've just derived the long-context wall. - Plot the roofline: log-log, intensity on x, achievable FLOPs on y, the two plateaus, and drop points for decode batch {1, 8, 64, 156, 512} and prefill {128, 2048}. This single figure is the mental map for all of Phase 18 — draw it once by hand.
References
- Williams et al., Roofline: An Insightful Visual Performance Model (CACM 2009) — the original: https://dl.acm.org/doi/10.1145/1498765.1498785
- kipply, Transformer Inference Arithmetic — this lab's model applied end-to-end, the single best blog post in the field: https://kipp.ly/transformer-inference-arithmetic/
- Pope et al., Efficiently Scaling Transformer Inference (2022) — the rigorous version, including the KV-traffic terms: https://arxiv.org/abs/2211.05102
- Chen, Dissecting Batching Effects in GPT Inference (2023) — measured curves of the batch-size free lunch: https://le.qun.ch/en/blog/2023/05/13/transformer-batching/
- NVIDIA A100/H100 datasheets — where the peak-FLOPs and bandwidth constants come from (always check whether a quoted TFLOPs number assumes sparsity; marketing does).
- Phase 18 — this lab, with
nsys/ncuattached and the simplifications removed.