Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 03-06 — Prefix Caching: Count Every Token It Saves [CPU-OK]

Lab-03 showed you prefix caching through the real engine's meters — hit rates, throughput averages, wall-clock noise. This lab removes the noise. On the mini engine, you'll run the same shared-system-prompt workload with caching off and on and account for the savings to the exact token: 544 scheduled tokens uncached, 96 cached, difference 448 = 7 followers × 16 blocks × 4 tokens. Not approximately. Exactly. When you can predict a cache's benefit with integer arithmetic before running it, you understand the cache.

Contents


Why this lab exists

Three phases of machinery converge here, and this lab is where you check that you can predict their composition: Phase 2's block hashing and sharing (lab 02-05), this phase's admission path (get_computed_blocks → adopt → allocate, lab 03-01), and the scheduling identities from Phase 1 lab-04 (Σ scheduled = prompt + max_tokens − 1). If your prediction of the cached total is off by even one token, one of those three mental models has a crack in it — and the integers will tell you which (that's how the over-allocation bug mentioned in Phase 2 lab-05 was actually found: an exact count disagreed).

The professional skill is dimensioned estimation of cache value. "Enable prefix caching and things get faster" is advocacy. "This workload shares a 64-token block-aligned prefix across 8 requests, so caching eliminates exactly 7×64 = 448 of 544 scheduled prefill+decode tokens — an 82% compute reduction on this batch, and here's the count" is engineering. The GPU version (lab-03) gives you the wall-clock corroboration; this lab gives you the theorem.

Background: what "scheduled tokens" measures

The probe sums total_num_scheduled_tokens over every schedule() call — every token of forward-pass work the scheduler ever requested. It is the engine's compute odometer: prefill chunks, cache-miss remainders, decodes, everything. Two properties make it the right meter here:

  • It's conserved: work not scheduled is work not done. There is no place for savings to hide or double-count.
  • It's schedule-invariant in total: chunking and batching rearrange when tokens are scheduled, never how many (lab-02/05). Only caching changes the total — by replacing computed tokens with adopted KV. So the off/on difference isolates the cache's effect perfectly. Experimental design through invariants: pick a meter on which everything else you might accidentally vary is provably neutral.

Files

  • starter.py — implement run_and_count(...): the probe + generate, returning the odometer total and the outputs. Your work.
  • solution.py — reference.
  • test_lab.py — exact totals for both arms, the savings identity, output equality, and the share-nothing control.

Run

LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-06-prefix-cache-savings -q
pytest phase-03-continuous-batching-scheduler/labs/lab-06-prefix-cache-savings -q   # reference

What to implement

The Phase 1 lab-04 probe, reduced to an accumulator: wrap eng.scheduler.schedule, add up out.total_num_scheduled_tokens, run eng.generate(...) over the prompts (greedy, ignore_eos), return (total, token_ids_per_prompt). Ten lines. The thinking is in the test predictions — write those yourself on paper before running anything.

The accounting, line by line

Workload: SYSTEM = "S"×64 (64 byte-tokens = exactly 16 full blocks at block_size 4 — alignment chosen deliberately), 8 prompts SYSTEM + str(i) (65 tokens, unique last token), max_tokens=4, greedy.

Caching off — every request pays full price:

per request: 65 (prefill) + 3 (decodes; the 4th token is sampled but never computed — Phase 1 lab-04)
           = 68
total      : 8 × 68 = 544

Caching on — the pioneer pays, the followers ride:

request 0   : 65 + 3 = 68      (cold cache: populates 16 block hashes during its prefill)
requests 1–7: 1 + 3  = 4 each  (!!)
total       : 68 + 7×4 = 96
savings     : 544 − 96 = 448 = 7 × 64

That 1 deserves a pause — it's three of this course's rules colliding in one token:

  1. The follower's 65-token prompt hits all 16 full blocks → 64 tokens adopted free.
  2. The hit cap (num_tokens − 1, Phase 2 lab-05) wouldn't bind here (64 ≤ 64), but the 65th token couldn't hit anyway: it's in a partial block (I3 — never cached) and it must be computed to produce logits (you need the model's output at the last position, and the cache stores only KV).
  3. So the scheduler admits the request with num_computed = 64, schedules exactly 1 token, and — because 64 + 1 == 65 — that same step samples (needs_sample, Phase 1 lab-03). A one-token prefill that immediately emits: the strangest-looking line you'll see in a scheduler trace, and now you can explain it.

Also notice when the followers hit: all 8 requests are admitted in the same schedule() call, yet requests 1–7 still hit blocks request 0 cached microseconds earlier — because mini_vllm (like upstream) registers blocks in the cache index at allocation time, inside the same admission loop. Caching is eager; sharing begins before the pioneer has computed a single value. (The KV contents don't exist yet — but the reservation is shared, and the prefill that fills it runs once. If that bends your brain, good; it's the detail most explanations skip.)

What the tests prove

TestWhat it pins
test_caching_off_pays_full_price_for_everyoneThe baseline identity: 8 × (65 + 3) = 544, no cache, no surprises
test_caching_on_computes_the_shared_prefix_onceThe cached total, exactly: 68 + 7×4 = 96 — every rule above, composed correctly
test_savings_equal_followers_times_shared_full_blocksThe savings identity (N−1) × shared_full_block_tokens — the formula you'll reuse to estimate cache value on any workload
test_outputs_are_identical_with_and_without_cachingCaching is a pure performance feature: same tokens out. (The cached KV is the KV — Phase 2 lab-06's identity, economically applied)
test_unshared_prompts_save_nothingThe control arm: distinct prompts share only a sliver of block-aligned prefix → savings < 25%. Caching is workload-dependent; anyone selling it flat-rate is selling

Hitchhiker's notes

  • The alignment was rigged, and you should notice. SYSTEM is exactly 16 blocks. Make it 66 tokens (16.5 blocks) and followers hit only 64 of 66 — the half-full block 17 recomputes for everyone, forever. On real tokenizers you don't control alignment, which is why measured hit rates hover below the naive prediction (lab-03's 93.7%) and why block_size enters cache math, not just memory math.
  • Map the integer totals to the GPU meters: hit rate ≈ adopted/looked-up = 7×64 / (some denominator including tails); prompt-throughput ratio ≈ 544-ish/96-ish ≈ 5× — squarely the 4–5× lab-03 measured through wall-clock noise. Exact model + noisy measurement agreeing is how you validate both; either alone can fool you.
  • Why followers cost 4 while the pioneer costs 68 is the per-request view of the economics: a follower's marginal cost is its unique content plus decodes. System prompts become nearly free at the margin; what stays expensive is what's per-user. This inverts prompt-engineering economics — long, rich shared instructions are cheap; per-request context is what you trim. Product decisions hang on this inversion.
  • enable_caching=False exists for a reason — it's the control arm of every caching benchmark, and occasionally a production choice (e.g. strict multi-tenant isolation — see lab-03's security note). A feature you can't turn off is a feature you can't measure.

Going further

  • Multi-turn: simulate a 5-turn conversation (each turn's prompt = previous prompt + previous output + new question) and predict, then measure, the per-turn scheduled tokens with caching on. You should see each turn pay only its delta. This is the chat-history result from lab-03's notes, now exact.
  • Eviction pressure: shrink num_blocks until followers stop hitting (the pioneer's blocks get evicted by the followers' own decode growth — Phase 2 lab-05's queue mechanics). Find the cliff; explain its location from pool arithmetic.
  • Derive the general formula: for N requests sharing a P-token prefix (block size B), savings = (N−1) × B × ⌊P/B⌋ ... except when the unique suffix is empty — then the hit cap (num_tokens − 1) bites and the formula needs a correction term. Write the corrected version and the test that proves it. (This edge — identical entire prompts — is exactly lab-03's 8-identical-prompts experiment.)

References

  • mini_vllm/scheduler.py — the WAITING-phase get_computed_blocks admission path you're metering (and the num_new_tokens == 0 → schedule 1 branch behind the one-token prefill).
  • mini_vllm/kv_cache.py::_cache_full_blocks — eager caching at allocation time.
  • upstream/vllm/v1/core/kv_cache_manager.py — the production twin of both.
  • vLLM docs, Automatic Prefix Caching — design doc: https://docs.vllm.ai/en/latest/design/prefix_caching.html
  • Phase 2 lab-05 — the block-level mechanics (ref counts, hit cap, revival) this lab meters through the scheduler.
  • Phase 3 lab-03 — the same experiment on real hardware, with wall-clocks attached.