Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 03-03 — Measure Real Prefix-Cache Hit Rate [GPU-OPT]

Every production LLM workload has a shape, and the shape is almost always "a long shared prefix, then a short unique tail" — system prompts, few-shot examples, conversation history, RAG boilerplate. Prefix caching turns that shared prefix from N prefills into one. In this lab you run the real engine on exactly that workload and watch the meters move: the hit-rate counter climbing to ~94%, prompt throughput jumping 4–5×, and KV usage staying near 1× a single prompt. These are the numbers that justify the feature — and you'll know how to reproduce them on your workload, which is the question that actually matters.

No GPU? Don't panic. The captured run below is annotated line by line; the analysis sections work entirely on paper. And lab-06 reproduces this experiment on the mini engine, CPU-only, with exact token accounting — do that one hands-on.

Contents


Why this lab exists

Prefix caching is the rare optimization that is simultaneously huge (multi-× on the right workload), free to enable (default-on in modern vLLM), and workload-dependent enough to be oversold (≈0 benefit on share-nothing batch jobs). An engineer who can't measure it is at the mercy of vibes in both directions. This lab builds the measurement reflex: construct a workload with known sharing, run with the feature off and on, and read three independent meters that must agree (hit rate, prompt throughput, KV usage). When the meters don't agree — hit rate high but no speedup, say — you've learned something real (often: the prefix wasn't block-aligned, or the workload was decode-dominated all along).

The same experiment is also your template for capacity claims: "enabling prefix caching will let this deployment serve 3× the QPS" is a sentence you should only say after running this lab's shape against your traffic.

Background: what a "hit" buys

From Phase 2 lab-05 you know the mechanism: full blocks of the prompt are content-hashed (parent-chained), and a new request adopts any cached chain head — touch, ref-count bump, zero compute. What that buys, concretely, per hit token:

  • Prefill compute: the entire forward pass for that token — skipped. TTFT for a request with an N-token cached prefix drops by roughly N/(prefill speed).
  • KV memory: the hit blocks are shared, not copied (ref_cnt += 1). Sixteen requests sharing a 1000-token system prompt store its KV once.
  • What it never buys: decode. Generated tokens are new by definition. A workload that prefills 50 tokens and decodes 2000 saves almost nothing — check your prefill:decode ratio before promising miracles.

The unit of caching is the full block (Phase 2's I3): a 130-token shared prefix at block_size 16 hits at most 8 blocks = 128 tokens, and divergence mid-block forfeits that block. Hence the operator's rule of thumb: put the static part first, pad nothing, and the boundary token of your template matters more than you'd think.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct   # small, modern, instruct-tuned

Steps

# run.py
from vllm import LLM, SamplingParams

SYSTEM = "You are a meticulous assistant. Follow instructions carefully. " * 30  # ~400 tokens shared

llm = LLM(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    enable_prefix_caching=True,      # <- the feature under test; flip for the control run
    gpu_memory_utilization=0.6,
    max_model_len=4096,
)

# 16 requests sharing SYSTEM, each with a unique tail.
prompts = [f"{SYSTEM}\n\nQuestion {i}: what is {i}+{i}?" for i in range(16)]
out = llm.generate(prompts, SamplingParams(max_tokens=16, temperature=0))
for o in out[:2]:
    print(repr(o.outputs[0].text))

Run twice — enable_prefix_caching=True then False — both under VLLM_LOGGING_LEVEL=DEBUG, and collect the three meters below for each. (One subtlety of this script: all 16 requests are submitted in one generate call, so requests 1..15 hit blocks request 0 cached moments earlier in the same run — vLLM caches blocks as soon as they're full, not when the request finishes. The mechanism is identical for requests arriving minutes apart, as long as the blocks survive eviction.)

What to measure

Metricprefix caching OFFprefix caching ON
Prefix cache hit rate0% (counter absent/zero)climbs toward (N−1)/N
Avg prompt throughputbaselineseveral × baseline
Peak KV-cache usage~16 × SYSTEM + tails~1 × SYSTEM + tails
TTFT, requests 2..16full prefill eachonly the unique tail prefills

Three of these are in the debug logs; TTFT you can take from the per-request timing if you use the API server, or infer from prompt throughput here.

Captured output (real run, Qwen2.5-0.5B, L4 24GB, vLLM 0.22.1)

# enable_prefix_caching=True
INFO ... Automatic prefix caching is enabled.
DEBUG ... Prefix cache hit rate: GPU: 0.00%      (request 0 populates the cache)
DEBUG ... Prefix cache hit rate: GPU: 93.7%      (requests 1..15 reuse SYSTEM's blocks)
INFO ... Avg prompt throughput: 41523 tokens/s   (mostly cached -> not recomputed)
'4'  '6'

# enable_prefix_caching=False (same workload)
DEBUG ... Prefix cache hit rate: GPU: 0.00%
INFO ... Avg prompt throughput: 9120 tokens/s    (every SYSTEM prefilled from scratch)

Reading the numbers like an operator

  • 0.00% → 93.7% — the first line is request 0 paying full price (the pioneer effect: a cache nobody populated cannot hit — same 1/N you saw in Phase 2 lab-03's 87.5%). Then 15 of 16 requests reuse SYSTEM's blocks. Why 93.7% and not 15/16 of all tokens? The denominator is queries (tokens looked up), and each request's unique tail plus its final block can't hit — the cap from Phase 2 lab-05: a hit covers at most full blocks of at most num_tokens − 1 tokens. Hit rates have denominators; always ask what's in them before quoting one.
  • 41523 vs 9120 tokens/s prompt throughput — the 4.6× is the shared prefix being computed once instead of 16 times. Sanity-check the ratio: with ~430 shared + ~15 unique tokens per prompt, the cached run computes ~1×430 + 16×15 ≈ 670 prefill tokens where the uncached run computes 16×445 ≈ 7120 — a ~10× compute saving, surfaced as ~4.6× in the wall-clock meter (the meter averages over windows that include decode time too). Meters measure what they measure; derive what you expected before trusting the headline.
  • The outputs are '4' and '6' — the same answers the uncached run gives. Cached KV is the same KV (Phase 2 lab-06's identity theorem, now economically significant). Correctness meters and performance meters move independently; check both.
  • Same arithmetic as lab-06 — which computes the exact scheduled-token saving ((N−1) × full-blocks-of-shared-prefix) on the mini engine where every token is countable. The GPU numbers above are that arithmetic, plus wall-clock noise.

Hitchhiker's notes

  • Conversation history is the killer app, not just system prompts: each turn re-sends the whole transcript, which is — by construction — a growing shared prefix with itself. A chat with T turns gets ~T× prefill savings on its own history. This is why every serious chat API (and vLLM-based products) leans on prefix/prompt caching, and why the commercial APIs sell it explicitly (Anthropic/OpenAI "prompt caching" — same idea, different billing).
  • What invalidates a cached prefix: eviction under memory pressure (the blocks are still just free-queue citizens — Phase 2 lab-05), reset_prefix_cache(), restart, or anything that changes what the KV means: different LoRA adapter, different model, different chat-template rendering of the "same" text. The hash chain includes token ids only after templating — two prompts that render differently share nothing, which is the most common "why is my hit rate 0" in practice (timestamp in the system prompt, per-user name early in the template, randomized example order...).
  • n>1 parallel sampling (Phase 9) reuses this exact machinery — N samples of one prompt share the prompt's blocks via the same ref_cnt mechanics. So do beam search and speculative-decoding draft trees. "Share immutable prefix KV via refcounted blocks" is load-bearing infrastructure, not a feature flag.
  • Security note for multi-tenant operators: cache timing is observable — a fast TTFT reveals someone recently prefilled the same prefix. Cross-tenant prefix caching can therefore leak prompt equality across tenants (a real, published attack class against LLM caches). vLLM's cache is per-engine; if you front multiple tenants, decide deliberately whether their prefixes may share a pool.

Reflect

  • Why does the first request show 0% even though the cache is enabled? And what is the steady-state hit rate of this workload as N → ∞? ((N−1)/N of the shareable tokens — the pioneer cost amortizes to nothing.)
  • Your workload prefills 2000 tokens of RAG context (unique per query!) and decodes 100. What hit rate do you expect? (~0 — unique context shares nothing. What would help? Reordering the prompt so static instructions precede the unique context, and caching exactly that. Prompt structure is a performance interface.)
  • Estimate: 16 requests × 430-token SYSTEM at ~36 KB/token-ish for a 0.5B model — how much KV memory did sharing save, in MB? Now do it for a 70B model at 405 KB/token and 64 concurrent requests. (This is why prefix caching is also a capacity feature, per Phase 2 lab-03's concurrency math.)

References

  • upstream/vllm/v1/core/kv_cache_manager.py::get_computed_blocks — where the hit happens, including the hit-rate accounting you watched.
  • mini_vllm/kv_cache.py::get_computed_blocks — the same logic at readable scale (Phase 2 lab-05 exercises it directly).
  • vLLM docs, Automatic Prefix Caching — design + operational notes: https://docs.vllm.ai/en/latest/design/prefix_caching.html
  • Zheng et al., SGLang / RadixAttention (2023) — prefix reuse generalized to a tree; the natural next read: https://arxiv.org/abs/2312.07104
  • Anthropic, Prompt caching announcement (2024) — the same economics, productized; good for building intuition about real workload shapes: https://www.anthropic.com/news/prompt-caching
  • Lab-06 in this phase — the CPU twin with exact token accounting.