Lab 03-06 — Prefix Caching: Count Every Token It Saves [CPU-OK]
Lab-03 showed you prefix caching through the real engine's meters — hit rates, throughput
averages, wall-clock noise. This lab removes the noise. On the mini engine, you'll run the
same shared-system-prompt workload with caching off and on and account for the savings to
the exact token: 544 scheduled tokens uncached, 96 cached, difference 448 = 7 followers × 16 blocks × 4 tokens. Not approximately. Exactly. When you can predict a
cache's benefit with integer arithmetic before running it, you understand the cache.
Contents
- Why this lab exists
- Background: what "scheduled tokens" measures
- Files
- Run
- What to implement
- The accounting, line by line
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Three phases of machinery converge here, and this lab is where you check that you can
predict their composition: Phase 2's block hashing and sharing (lab 02-05), this phase's
admission path (get_computed_blocks → adopt → allocate, lab 03-01), and the scheduling
identities from Phase 1 lab-04 (Σ scheduled = prompt + max_tokens − 1). If your prediction
of the cached total is off by even one token, one of those three mental models has a crack
in it — and the integers will tell you which (that's how the over-allocation bug mentioned
in Phase 2 lab-05 was actually found: an exact count disagreed).
The professional skill is dimensioned estimation of cache value. "Enable prefix caching and things get faster" is advocacy. "This workload shares a 64-token block-aligned prefix across 8 requests, so caching eliminates exactly 7×64 = 448 of 544 scheduled prefill+decode tokens — an 82% compute reduction on this batch, and here's the count" is engineering. The GPU version (lab-03) gives you the wall-clock corroboration; this lab gives you the theorem.
Background: what "scheduled tokens" measures
The probe sums total_num_scheduled_tokens over every schedule() call — every token of
forward-pass work the scheduler ever requested. It is the engine's compute odometer:
prefill chunks, cache-miss remainders, decodes, everything. Two properties make it the
right meter here:
- It's conserved: work not scheduled is work not done. There is no place for savings to hide or double-count.
- It's schedule-invariant in total: chunking and batching rearrange when tokens are scheduled, never how many (lab-02/05). Only caching changes the total — by replacing computed tokens with adopted KV. So the off/on difference isolates the cache's effect perfectly. Experimental design through invariants: pick a meter on which everything else you might accidentally vary is provably neutral.
Files
starter.py— implementrun_and_count(...): the probe +generate, returning the odometer total and the outputs. Your work.solution.py— reference.test_lab.py— exact totals for both arms, the savings identity, output equality, and the share-nothing control.
Run
LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-06-prefix-cache-savings -q
pytest phase-03-continuous-batching-scheduler/labs/lab-06-prefix-cache-savings -q # reference
What to implement
The Phase 1 lab-04 probe, reduced to an accumulator: wrap eng.scheduler.schedule, add up
out.total_num_scheduled_tokens, run eng.generate(...) over the prompts (greedy,
ignore_eos), return (total, token_ids_per_prompt). Ten lines. The thinking is in the
test predictions — write those yourself on paper before running anything.
The accounting, line by line
Workload: SYSTEM = "S"×64 (64 byte-tokens = exactly 16 full blocks at block_size 4 —
alignment chosen deliberately), 8 prompts SYSTEM + str(i) (65 tokens, unique last token),
max_tokens=4, greedy.
Caching off — every request pays full price:
per request: 65 (prefill) + 3 (decodes; the 4th token is sampled but never computed — Phase 1 lab-04)
= 68
total : 8 × 68 = 544
Caching on — the pioneer pays, the followers ride:
request 0 : 65 + 3 = 68 (cold cache: populates 16 block hashes during its prefill)
requests 1–7: 1 + 3 = 4 each (!!)
total : 68 + 7×4 = 96
savings : 544 − 96 = 448 = 7 × 64
That 1 deserves a pause — it's three of this course's rules colliding in one token:
- The follower's 65-token prompt hits all 16 full blocks → 64 tokens adopted free.
- The hit cap (
num_tokens − 1, Phase 2 lab-05) wouldn't bind here (64 ≤ 64), but the 65th token couldn't hit anyway: it's in a partial block (I3 — never cached) and it must be computed to produce logits (you need the model's output at the last position, and the cache stores only KV). - So the scheduler admits the request with
num_computed = 64, schedules exactly1token, and — because64 + 1 == 65— that same step samples (needs_sample, Phase 1 lab-03). A one-token prefill that immediately emits: the strangest-looking line you'll see in a scheduler trace, and now you can explain it.
Also notice when the followers hit: all 8 requests are admitted in the same
schedule() call, yet requests 1–7 still hit blocks request 0 cached microseconds
earlier — because mini_vllm (like upstream) registers blocks in the cache index at
allocation time, inside the same admission loop. Caching is eager; sharing begins before
the pioneer has computed a single value. (The KV contents don't exist yet — but the
reservation is shared, and the prefill that fills it runs once. If that bends your
brain, good; it's the detail most explanations skip.)
What the tests prove
| Test | What it pins |
|---|---|
test_caching_off_pays_full_price_for_everyone | The baseline identity: 8 × (65 + 3) = 544, no cache, no surprises |
test_caching_on_computes_the_shared_prefix_once | The cached total, exactly: 68 + 7×4 = 96 — every rule above, composed correctly |
test_savings_equal_followers_times_shared_full_blocks | The savings identity (N−1) × shared_full_block_tokens — the formula you'll reuse to estimate cache value on any workload |
test_outputs_are_identical_with_and_without_caching | Caching is a pure performance feature: same tokens out. (The cached KV is the KV — Phase 2 lab-06's identity, economically applied) |
test_unshared_prompts_save_nothing | The control arm: distinct prompts share only a sliver of block-aligned prefix → savings < 25%. Caching is workload-dependent; anyone selling it flat-rate is selling |
Hitchhiker's notes
- The alignment was rigged, and you should notice.
SYSTEMis exactly 16 blocks. Make it 66 tokens (16.5 blocks) and followers hit only 64 of 66 — the half-full block 17 recomputes for everyone, forever. On real tokenizers you don't control alignment, which is why measured hit rates hover below the naive prediction (lab-03's 93.7%) and why block_size enters cache math, not just memory math. - Map the integer totals to the GPU meters: hit rate ≈ adopted/looked-up = 7×64 / (some denominator including tails); prompt-throughput ratio ≈ 544-ish/96-ish ≈ 5× — squarely the 4–5× lab-03 measured through wall-clock noise. Exact model + noisy measurement agreeing is how you validate both; either alone can fool you.
- Why followers cost 4 while the pioneer costs 68 is the per-request view of the economics: a follower's marginal cost is its unique content plus decodes. System prompts become nearly free at the margin; what stays expensive is what's per-user. This inverts prompt-engineering economics — long, rich shared instructions are cheap; per-request context is what you trim. Product decisions hang on this inversion.
enable_caching=Falseexists for a reason — it's the control arm of every caching benchmark, and occasionally a production choice (e.g. strict multi-tenant isolation — see lab-03's security note). A feature you can't turn off is a feature you can't measure.
Going further
- Multi-turn: simulate a 5-turn conversation (each turn's prompt = previous prompt + previous output + new question) and predict, then measure, the per-turn scheduled tokens with caching on. You should see each turn pay only its delta. This is the chat-history result from lab-03's notes, now exact.
- Eviction pressure: shrink
num_blocksuntil followers stop hitting (the pioneer's blocks get evicted by the followers' own decode growth — Phase 2 lab-05's queue mechanics). Find the cliff; explain its location from pool arithmetic. - Derive the general formula: for N requests sharing a P-token prefix (block size B),
savings =
(N−1) × B × ⌊P/B⌋... except when the unique suffix is empty — then the hit cap (num_tokens − 1) bites and the formula needs a correction term. Write the corrected version and the test that proves it. (This edge — identical entire prompts — is exactly lab-03's 8-identical-prompts experiment.)
References
mini_vllm/scheduler.py— the WAITING-phaseget_computed_blocksadmission path you're metering (and thenum_new_tokens == 0 → schedule 1branch behind the one-token prefill).mini_vllm/kv_cache.py::_cache_full_blocks— eager caching at allocation time.upstream/vllm/v1/core/kv_cache_manager.py— the production twin of both.- vLLM docs, Automatic Prefix Caching — design doc: https://docs.vllm.ai/en/latest/design/prefix_caching.html
- Phase 2 lab-05 — the block-level mechanics (ref counts, hit cap, revival) this lab meters through the scheduler.
- Phase 3 lab-03 — the same experiment on real hardware, with wall-clocks attached.