Lab 02-05 — Share and Evict: the Life of a Cached Block [CPU-OK]
Lab-01 built the allocator's mechanisms. This lab plays them like an instrument. You'll run two requests with identical prompts and watch their block tables converge onto the same physical blocks; free them and watch the blocks linger in the cache as eviction candidates; apply memory pressure and watch eviction consume them in exactly the order that preserves the most valuable prefix longest; and finally watch a "dead" request's blocks get revived from the middle of the free queue by a newcomer — the maneuver the whole hand-rolled linked list exists for.
This is the block's full biography: allocated → shared → orphaned-but-cached → revived (or evicted). After this lab, prefix caching is not a feature you enable; it's a state machine you can narrate.
Contents
- Why this lab exists
- Background: one cache, zero dedicated memory
- Files
- Run
- What to implement
- What the tests prove — a guided tour
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Prefix caching is the highest-leverage feature in modern LLM serving — every chatbot re-sends its system prompt and conversation history with every turn, and caching turns that repeated prefill into a hash lookup (Phase 3 lab-03 measures a 4–5× prompt-throughput jump from exactly this). But it's also the feature whose bugs are the scariest: get the sharing wrong and one user's KV bleeds into another's generation; get eviction wrong and your "cache" silently stops hitting under load, which nobody notices until the GPU bill doubles.
The reason this lab drives KVCacheManager directly — no engine, no scheduler — is that
sharing bugs hide in integration. When you call get_computed_blocks and allocate_slots
with your own hands, every ref count is yours to predict before you assert it. (This lab's
exact-sized-pool test would, in fact, have caught a real over-allocation bug in an earlier
version of mini_vllm's allocate_slots — see the caller-contract comment in
kv_cache.py. Accounting bugs in allocators don't crash; they quietly shrink your
capacity. Tests that count blocks exactly are how you catch them.)
Background: one cache, zero dedicated memory
Recall the design from lab-01 (invariant I4): vLLM's prefix cache has no memory of its
own. There is one pool of blocks. A block whose ref_cnt drops to 0 goes back to the free
queue but keeps its content hash and stays in the cache index. From that moment it leads a
double life:
- if a new allocation pops it off the free queue first → evicted (hash dropped, contents about to be overwritten);
- if a prefix hit finds it first → revived:
touch()yanks it out of the middle of the free queue in O(1) and bumps its ref count. No KV is recomputed, no bytes move.
Which fate a block meets is decided purely by queue order — and the queue is ordered by
KVCacheManager.free(), which returns each request's blocks in reverse table order.
Tail blocks (deep, request-specific context) are enqueued first = evicted first; head blocks
(the shared system-prompt territory) are enqueued last = survive longest. An entire
cache-replacement policy, expressed as reversed(blocks). You'll prove it works with four
asserts.
The other rule you'll meet: a hit can cover at most num_tokens − 1 tokens. The last
position must always be recomputed, because what the engine needs from it is not its KV
but its logits — the model's output at that position — and the cache stores only KV.
Hence the slightly surprising cached == 28 (not 32) for a fully-duplicated 32-token
prompt.
Files
starter.py— implementprefill(the scheduler's admission dance, five steps spelled out in the docstring) andref_counts(a one-line probe). Your work.solution.py— reference.test_lab.py— the biography: cold cache, sharing, divergence, eviction order, revival, and the caching-off control.
Run
LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-05-share-and-evict -q
pytest phase-02-paged-attention/labs/lab-05-share-and-evict -q # reference (default)
What to implement
prefill(kv, token_ids) reproduces, in miniature, what Scheduler.schedule does when it
admits a WAITING request: consult get_computed_blocks → adopt the head start into
num_computed_tokens → allocate_slots with the hit blocks → mark the prefill done. The
order matters and the docstring is explicit about why (allocation accounting trusts the
counter). ref_counts is your microscope: the per-block reference counts that make sharing
visible.
What the tests prove — a guided tour
Block size 4, prompt = 32 tokens = 8 full blocks. Read these as a story, in order:
test_first_request_populates_cold_cache— request A:cached == 0, 8 fresh blocks, allref_cnt == 1. The first requester always pays full price — remember this when a dashboard shows a hit rate below 100% on identical traffic; the denominator includes the pioneers (you'll see the same 87.5% effect in lab-03's capture).test_identical_prompt_shares_all_but_the_tail_block— request B, same 32 tokens:cached == 28. Seven blocks of B's table are the same physical ids as A's, now atref_cnt == 2; the eighth is private ([2,2,2,2,2,2,2,1]). Two reasons the tail isn't shared, both worth internalizing: the hit cap (num_tokens − 1— the logits rule above) and the safety rule that writes only ever target private blocks. And the bottom line of paging economics: serving B's prompt cost the pool one block instead of eight.test_diverging_prompt_shares_only_the_common_prefix— same first 16 tokens, then different:cached == 16. Matching is contiguous-from-the-start and stops at the first miss — that's the parent-chained hash doing its job. There is no "middle matching": KV at position i depends on everything before it, so a mid-sequence match would be semantically meaningless even if the hashes collided.test_free_order_evicts_tails_before_shared_prefix— the policy test, on a pool sized exactly (10 blocks: null + A's 8 + B's tail). Free A, free B: 9 blocks idle, all still cached. Demand 2 → the two private tails die. The head block of the shared prefix is the last cached block standing. Reverse-order free = LRU-flavored, prefix-preserving eviction, with zero policy code at eviction time.test_cached_free_blocks_are_revived_not_recomputed— free A entirely, then admit D with the same prompt:cached == 28again, ref counts back to 1. Nobody held those blocks; the cache alone kept them meaningful, andtouch()pulled them from the middle of the free queue. This is the O(1)-middle-removal payoff — and it's why a chatbot whose users go idle for a minute still gets cache hits when they return, as long as memory pressure hasn't claimed the blocks.test_caching_disabled_means_no_sharing— the control group.enable_caching=False→cached == 0always. When you benchmark caching (Phase 3 labs 03/06), this is the baseline arm.
Hitchhiker's notes
ref_cnt == 2means the block is load-bearing for two conversations. Production incident shape: a bug decrements a shared block to 0 while a request still references it (violating I1), the block gets reallocated, and user A's chatbot continues user B's story. This class of bug is why the invariant tests in lab-01 exist, and why upstream reviews of kv-cache PRs are paranoid about everyref_cntline.- Eviction here is LRU-ish, not LRU. True LRU would track per-block access times; the
queue order approximates it (recently-freed = recently-used = enqueued later) and adds
the prefix-aware twist (tails before heads within one request's free). Upstream
additionally re-
touches hit blocks, refreshing their position. Knowing exactly which policy you have matters when someone proposes "just make it LFU" — the current policy's cost is zero bookkeeping at eviction time, and any replacement must beat hit-rate × that-cost, not just hit-rate. (RadixAttention in SGLang is the structured alternative: a trie over prefixes with explicit LRU — same problem, different data structure.) - The
num_tokens − 1cap shows up everywhere. It's inget_computed_blocks(max_hit_tokens = request.num_tokens - 1), in the scheduler's "fully cached except the last token → schedule that 1 token" branch, and upstream asmax_cache_hit_length. When you see a mysterious single-token prefill in a trace (Phase 3 lab-06 will show you one), this rule is why. - Hash chains make divergence detection O(1) per block — no token comparison happens at admission, only hash-map lookups. The cost was paid at caching time (hashing each full block once). Amortize-at-write, free-at-read is the right shape for a cache whose reads (every admission) vastly outnumber writes (each block cached once).
Going further
- Add a test where three requests share a prefix and free in a scrambled order; predict
the full free-queue order on paper first, then assert it via
kv.block_pool.free_queue.get_all_free_blocks(). (This is harder than it looks — that's the point. The queue order is the eviction policy; you should be able to compute it.) - Implement
cache_hit_rate(kv)— hits / lookups acrossget_computed_blockscalls — and recreate lab-03'sPrefix cache hit rate: GPU: 87.5%line on the mini engine. Then go compare with Phase 3 lab-06, which measures the same thing through the full scheduler. - Read
upstream/vllm/v1/core/kv_cache_manager.py::get_computed_blocksand find the two production wrinkles this lab omits: the request-level hash includes extras (LoRA id, multimodal hashes — anything that changes what KV means), and lookup latency is tracked for the metrics you saw in lab-03's logs.
References
mini_vllm/kv_cache.py—get_computed_blocks/allocate_slots/free, the three calls you choreographed (note the caller-contract comment inallocate_slots).mini_vllm/block_pool.py—touchand_maybe_evict: the revival/eviction fork.upstream/vllm/v1/core/kv_cache_manager.py:194,236— the production admission dance.- vLLM docs, Automatic Prefix Caching — design doc for the hash-chain scheme: https://docs.vllm.ai/en/latest/design/prefix_caching.html
- Zheng et al., SGLang: Efficient Execution of Structured Language Model Programs — RadixAttention, the trie-based alternative to hash-chain prefix caching; great contrast read: https://arxiv.org/abs/2312.07104
- Kwon et al., PagedAttention (SOSP 2023), §4.4 — sharing & copy-on-write in the original design: https://arxiv.org/abs/2309.06180