Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 02-05 — Share and Evict: the Life of a Cached Block [CPU-OK]

Lab-01 built the allocator's mechanisms. This lab plays them like an instrument. You'll run two requests with identical prompts and watch their block tables converge onto the same physical blocks; free them and watch the blocks linger in the cache as eviction candidates; apply memory pressure and watch eviction consume them in exactly the order that preserves the most valuable prefix longest; and finally watch a "dead" request's blocks get revived from the middle of the free queue by a newcomer — the maneuver the whole hand-rolled linked list exists for.

This is the block's full biography: allocated → shared → orphaned-but-cached → revived (or evicted). After this lab, prefix caching is not a feature you enable; it's a state machine you can narrate.

Contents


Why this lab exists

Prefix caching is the highest-leverage feature in modern LLM serving — every chatbot re-sends its system prompt and conversation history with every turn, and caching turns that repeated prefill into a hash lookup (Phase 3 lab-03 measures a 4–5× prompt-throughput jump from exactly this). But it's also the feature whose bugs are the scariest: get the sharing wrong and one user's KV bleeds into another's generation; get eviction wrong and your "cache" silently stops hitting under load, which nobody notices until the GPU bill doubles.

The reason this lab drives KVCacheManager directly — no engine, no scheduler — is that sharing bugs hide in integration. When you call get_computed_blocks and allocate_slots with your own hands, every ref count is yours to predict before you assert it. (This lab's exact-sized-pool test would, in fact, have caught a real over-allocation bug in an earlier version of mini_vllm's allocate_slots — see the caller-contract comment in kv_cache.py. Accounting bugs in allocators don't crash; they quietly shrink your capacity. Tests that count blocks exactly are how you catch them.)

Background: one cache, zero dedicated memory

Recall the design from lab-01 (invariant I4): vLLM's prefix cache has no memory of its own. There is one pool of blocks. A block whose ref_cnt drops to 0 goes back to the free queue but keeps its content hash and stays in the cache index. From that moment it leads a double life:

  • if a new allocation pops it off the free queue first → evicted (hash dropped, contents about to be overwritten);
  • if a prefix hit finds it first → revived: touch() yanks it out of the middle of the free queue in O(1) and bumps its ref count. No KV is recomputed, no bytes move.

Which fate a block meets is decided purely by queue order — and the queue is ordered by KVCacheManager.free(), which returns each request's blocks in reverse table order. Tail blocks (deep, request-specific context) are enqueued first = evicted first; head blocks (the shared system-prompt territory) are enqueued last = survive longest. An entire cache-replacement policy, expressed as reversed(blocks). You'll prove it works with four asserts.

The other rule you'll meet: a hit can cover at most num_tokens − 1 tokens. The last position must always be recomputed, because what the engine needs from it is not its KV but its logits — the model's output at that position — and the cache stores only KV. Hence the slightly surprising cached == 28 (not 32) for a fully-duplicated 32-token prompt.

Files

  • starter.py — implement prefill (the scheduler's admission dance, five steps spelled out in the docstring) and ref_counts (a one-line probe). Your work.
  • solution.py — reference.
  • test_lab.py — the biography: cold cache, sharing, divergence, eviction order, revival, and the caching-off control.

Run

LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-05-share-and-evict -q
pytest phase-02-paged-attention/labs/lab-05-share-and-evict -q   # reference (default)

What to implement

prefill(kv, token_ids) reproduces, in miniature, what Scheduler.schedule does when it admits a WAITING request: consult get_computed_blocks → adopt the head start into num_computed_tokensallocate_slots with the hit blocks → mark the prefill done. The order matters and the docstring is explicit about why (allocation accounting trusts the counter). ref_counts is your microscope: the per-block reference counts that make sharing visible.

What the tests prove — a guided tour

Block size 4, prompt = 32 tokens = 8 full blocks. Read these as a story, in order:

  1. test_first_request_populates_cold_cache — request A: cached == 0, 8 fresh blocks, all ref_cnt == 1. The first requester always pays full price — remember this when a dashboard shows a hit rate below 100% on identical traffic; the denominator includes the pioneers (you'll see the same 87.5% effect in lab-03's capture).
  2. test_identical_prompt_shares_all_but_the_tail_block — request B, same 32 tokens: cached == 28. Seven blocks of B's table are the same physical ids as A's, now at ref_cnt == 2; the eighth is private ([2,2,2,2,2,2,2,1]). Two reasons the tail isn't shared, both worth internalizing: the hit cap (num_tokens − 1 — the logits rule above) and the safety rule that writes only ever target private blocks. And the bottom line of paging economics: serving B's prompt cost the pool one block instead of eight.
  3. test_diverging_prompt_shares_only_the_common_prefix — same first 16 tokens, then different: cached == 16. Matching is contiguous-from-the-start and stops at the first miss — that's the parent-chained hash doing its job. There is no "middle matching": KV at position i depends on everything before it, so a mid-sequence match would be semantically meaningless even if the hashes collided.
  4. test_free_order_evicts_tails_before_shared_prefix — the policy test, on a pool sized exactly (10 blocks: null + A's 8 + B's tail). Free A, free B: 9 blocks idle, all still cached. Demand 2 → the two private tails die. The head block of the shared prefix is the last cached block standing. Reverse-order free = LRU-flavored, prefix-preserving eviction, with zero policy code at eviction time.
  5. test_cached_free_blocks_are_revived_not_recomputed — free A entirely, then admit D with the same prompt: cached == 28 again, ref counts back to 1. Nobody held those blocks; the cache alone kept them meaningful, and touch() pulled them from the middle of the free queue. This is the O(1)-middle-removal payoff — and it's why a chatbot whose users go idle for a minute still gets cache hits when they return, as long as memory pressure hasn't claimed the blocks.
  6. test_caching_disabled_means_no_sharing — the control group. enable_caching=Falsecached == 0 always. When you benchmark caching (Phase 3 labs 03/06), this is the baseline arm.

Hitchhiker's notes

  • ref_cnt == 2 means the block is load-bearing for two conversations. Production incident shape: a bug decrements a shared block to 0 while a request still references it (violating I1), the block gets reallocated, and user A's chatbot continues user B's story. This class of bug is why the invariant tests in lab-01 exist, and why upstream reviews of kv-cache PRs are paranoid about every ref_cnt line.
  • Eviction here is LRU-ish, not LRU. True LRU would track per-block access times; the queue order approximates it (recently-freed = recently-used = enqueued later) and adds the prefix-aware twist (tails before heads within one request's free). Upstream additionally re-touches hit blocks, refreshing their position. Knowing exactly which policy you have matters when someone proposes "just make it LFU" — the current policy's cost is zero bookkeeping at eviction time, and any replacement must beat hit-rate × that-cost, not just hit-rate. (RadixAttention in SGLang is the structured alternative: a trie over prefixes with explicit LRU — same problem, different data structure.)
  • The num_tokens − 1 cap shows up everywhere. It's in get_computed_blocks (max_hit_tokens = request.num_tokens - 1), in the scheduler's "fully cached except the last token → schedule that 1 token" branch, and upstream as max_cache_hit_length. When you see a mysterious single-token prefill in a trace (Phase 3 lab-06 will show you one), this rule is why.
  • Hash chains make divergence detection O(1) per block — no token comparison happens at admission, only hash-map lookups. The cost was paid at caching time (hashing each full block once). Amortize-at-write, free-at-read is the right shape for a cache whose reads (every admission) vastly outnumber writes (each block cached once).

Going further

  • Add a test where three requests share a prefix and free in a scrambled order; predict the full free-queue order on paper first, then assert it via kv.block_pool.free_queue.get_all_free_blocks(). (This is harder than it looks — that's the point. The queue order is the eviction policy; you should be able to compute it.)
  • Implement cache_hit_rate(kv) — hits / lookups across get_computed_blocks calls — and recreate lab-03's Prefix cache hit rate: GPU: 87.5% line on the mini engine. Then go compare with Phase 3 lab-06, which measures the same thing through the full scheduler.
  • Read upstream/vllm/v1/core/kv_cache_manager.py::get_computed_blocks and find the two production wrinkles this lab omits: the request-level hash includes extras (LoRA id, multimodal hashes — anything that changes what KV means), and lookup latency is tracked for the metrics you saw in lab-03's logs.

References

  • mini_vllm/kv_cache.pyget_computed_blocks / allocate_slots / free, the three calls you choreographed (note the caller-contract comment in allocate_slots).
  • mini_vllm/block_pool.pytouch and _maybe_evict: the revival/eviction fork.
  • upstream/vllm/v1/core/kv_cache_manager.py:194,236 — the production admission dance.
  • vLLM docs, Automatic Prefix Caching — design doc for the hash-chain scheme: https://docs.vllm.ai/en/latest/design/prefix_caching.html
  • Zheng et al., SGLang: Efficient Execution of Structured Language Model Programs — RadixAttention, the trie-based alternative to hash-chain prefix caching; great contrast read: https://arxiv.org/abs/2312.07104
  • Kwon et al., PagedAttention (SOSP 2023), §4.4 — sharing & copy-on-write in the original design: https://arxiv.org/abs/2309.06180