Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 00-04 — Prefill vs Decode: the Roofline Arithmetic [CPU-OK]

Lab-01 showed you that caching splits generation into two phases. This lab shows you the strange physics those phases live under — with six one-line functions and an A100 spec sheet. You'll compute the ridge point (the GPU's FLOP-per-byte ratio), discover that a single decoding sequence uses less than 1% of the GPU's compute, derive the hard speed limit of decode (125 tokens/s for an 8B model on an A100 — no kernel wizardry can beat it), and find the critical batch size where decode finally becomes compute-bound. These four numbers are the worldview of performance engineering; everything in Phase 18 is this lab with profilers attached.

Contents


Why this lab exists

There is a question that separates engineers who tune inference systems from engineers who guess at them: "is this workload compute-bound or memory-bound?" Every optimization belongs to one regime or the other — a faster matmul kernel does nothing for a bandwidth-bound decode; more batching does nothing for a compute-bound prefill — and applying a fix from the wrong regime is the most common way smart people waste a quarter. The roofline model answers the question with division. This lab makes you do the division until it's reflexive.

It also explains, from first principles, the economics you've been absorbing all phase: why batch size is the master lever of throughput (lab and Phase 1 lab-04), why continuous batching was worth inventing (Phase 3), why chunked prefill mixes the two phases in one step (Phase 3 lab-05 — piggybacking compute-hungry prefill onto bandwidth-starved decode steps), and why speculative decoding (Phase 8) can "spend FLOPs to save time" — the FLOPs were idle anyway.

Background: the roofline model in five minutes

Every computation has an arithmetic intensity: FLOPs performed per byte moved from memory. Every processor has a ridge point: peak FLOPs ÷ memory bandwidth. The roofline law is one comparison:

  • intensity < ridge → memory-bound: the compute units finish early and wait for the bus. Speed = bandwidth × intensity. More FLOPs won't help; fewer bytes will.
  • intensity ≥ ridge → compute-bound: the bus keeps up; you're using the silicon. Speed = peak FLOPs. Fewer bytes won't help; better kernels / more hardware will.

Our minimal model of a transformer step: it must read all weights once (n_params × dtype_bytes from HBM — weights don't fit in cache) and performs ~2 FLOPs per parameter per token (each weight participates in one multiply-accumulate per token). So:

intensity = 2 · n_params · num_tokens / (n_params · dtype_bytes) = 2 · num_tokens / dtype_bytes

The parameter count cancels. Intensity doesn't care how big the model is — only how many tokens share one trip through the weights. That cancellation is the single most clarifying fact in inference performance: prefill (thousands of tokens per trip) and decode (one token per trip per sequence) aren't two workloads that differ in degree; they're the two opposite ends of the roofline, by construction.

(The model ignores KV/activation traffic and attention FLOPs — see Hitchhiker's notes for when that bites. As a first-order tool it's startlingly accurate.)

Files

  • starter.py — six functions, each ~one line, each a load-bearing concept. Your work.
  • solution.py — reference.
  • test_lab.py — A100 numbers, the cancellation, both regimes, the crossover, and the decode speed limit.

Run

LAB_IMPL=starter pytest phase-00-foundations/labs/lab-04-prefill-vs-decode -q
pytest phase-00-foundations/labs/lab-04-prefill-vs-decode -q   # reference (default)

The numbers, walked through

A100-80GB SXM: 312 TFLOPs fp16, 2.0 TB/s HBM. 8B model, fp16 (16 GB of weights).

ridge      : 312e12 / 2.0e12 = 156 FLOPs/byte
decode, batch 1 : intensity = 2·1/2 = 1        → 1 ≪ 156: bandwidth-bound, using 1/156 ≈ 0.6% of compute
prefill, 2048 tokens : intensity = 2048        → ≫ 156: compute-bound
crossover  : 156 tokens/step                   → the "critical batch size"
speed limit: 2.0e12 / 16e9 = 125 steps/s       → 125 tok/s per sequence, max, ever

Read them like an engineer:

  • 0.6% compute utilization for single-stream decode. The other 99.4% of a $15k GPU is structurally idle — not because of bad kernels, but because one token per weight-trip is all the arithmetic the workload offers. This is the number that makes batching non-optional: every additional sequence in the decode batch reuses the same weight bytes, adding intensity ~1 per sequence, for free until you hit the ridge at ~156. Batch 64 → 64× the tokens/s at essentially the same step time. That free lunch is the entire economic basis of serving (and of this course's obsession with fitting more sequences in memory — lab-02 — since memory is what caps the batch).
  • 125 tokens/s is a physics ceiling, not an engineering one: a decode step cannot complete faster than the weights can stream from HBM. Measure any well-tuned 8B/fp16 deployment and single-stream decode sits at 60–80% of this bound (the rest: KV reads, kernel overheads). When someone promises 500 tok/s single-stream on this hardware, they're describing quantization (fewer bytes — note dtype_bytes in your formula), speculation (Phase 8), or fiction.
  • Prefill at 2048 is compute-bound — which is why TTFT responds to better kernels and FlashAttention (Phase 4), while ITL responds to memory bandwidth and quantization. Two metrics, two regimes, two completely disjoint optimization menus. Now you know why Phase 3's chunked prefill mixes the phases in one batch: decode steps have idle FLOPs; prefill chunks are pure FLOPs; together they fill the roofline from both sides (Sarathi's "piggybacking", which you measured as [33, 33, …] in Phase 3 lab-05).
  • The crossover at 156 is worth memorizing as a shape, not a number: it moves with hardware (H100: ~295 fp16; consumer cards: lower) and dtype. "Decode needs ~ridge-many tokens per step to saturate compute" is the portable version.

What the tests prove

TestWhat it pins
test_a100_ridge_point156.0 — the one hardware constant of this lab
test_intensity_is_just_tokens_over_dtypeThe cancellation: 8B and 70B give identical intensity. If this surprises you, reread the Background — it's the lab's central fact
test_single_decode_is_hopelessly_bandwidth_boundIntensity 1 vs ridge 156
test_prefill_is_compute_boundIntensity 2048 vs ridge 156 — same model, opposite regime
test_critical_batch_size_is_the_ridge155 tokens: memory-bound; 156: compute-bound. The crossover is exactly the ridge, because intensity (fp16) = tokens
test_decode_speed_limit_8b_fp16125 tok/s — bandwidth over weight bytes, the unbeatable ceiling
test_batching_multiplies_decode_throughput_for_freeBatch 64 → 8000 tok/s from the same weight stream — the free lunch, quantified

Hitchhiker's notes

  • Where the weights-only model bends: at long contexts, KV reads become the dominant bytes (128 KiB/token from lab-02 × thousands of tokens × batch — eventually exceeding the 16 GB weight read!). That's why long-context decode gets slower per token even though "the model is the same size," why GQA/MLA exist (shrink KV bytes), and why Phase 18 extends this lab's model with a KV-traffic term. First-order tools, knowingly applied, then refined — that's the discipline.
  • Why ~2 FLOPs per param per token? Each parameter sits in some matrix; processing a token multiplies it by one activation and adds into an accumulator — one FMA = 2 FLOPs. Attention adds FLOPs quadratic in sequence length on top (it's parameter-free, so it escapes this accounting); for short-to-moderate contexts the linear layers dominate and 2·N·T is a good model. The famous training version of the same estimate is 6·N·T (forward + backward); inference keeps only the 2.
  • Quantization through this lens: INT4 weights = 4 GB to stream = 500 steps/s ceiling, 4× decode speedup with zero kernel cleverness — bandwidth-bound workloads reward byte-shrinking one-for-one. But the same INT4 does ~nothing for compute-bound prefill (the FLOPs still happen, often in fp16 after dequant). One optimization, two regimes, two completely different value propositions — Phase 6 lives here.
  • The ridge explains CUDA graphs too (Phase 5): your 125 steps/s ceiling means a decode step takes ≥ 8 ms on this 8B model (16 GB streamed at 2 TB/s) — at that scale a few hundred microseconds of kernel-launch overhead is noise. Shrink the model to 1B and steps drop toward 1 ms; suddenly launch overhead is a first-order cost, and capturing the whole step as one CUDA graph pays for itself. The roofline tells you when overhead optimizations matter, too.

Going further

  • Recompute everything for an H100 SXM (~990 TFLOPs fp16, ~3.35 TB/s): ridge ≈ 295, 8B decode ceiling ≈ 209 tok/s. Notice the ridge rose — new GPUs gain FLOPs faster than bandwidth, so decode gets relatively more memory-bound every generation. That trend line is why KV/weight compression research keeps accelerating.
  • Add a kv_bytes_per_step(batch, context_len, kv_per_token) term (from lab-02) to decode_tokens_per_second and find the context length where KV traffic overtakes the weight read for batch 64. You've just derived the long-context wall.
  • Plot the roofline: log-log, intensity on x, achievable FLOPs on y, the two plateaus, and drop points for decode batch {1, 8, 64, 156, 512} and prefill {128, 2048}. This single figure is the mental map for all of Phase 18 — draw it once by hand.

References