Phase 02 — Exercises: PagedAttention
Escalating from "explain it" to "design it." Staff-level = you can do the last ones cold, and
point to the exact upstream/ file that proves your answer.
Contents
Warm-up (explain)
- In one sentence each, define: block, block table, block pool,
ref_cnt, null block. - Why is per-request waste bounded by
block_size − 1? Where does that one partial block come from? - Why does
get_computed_blockscap hits atnum_tokens − 1and notnum_tokens? (Hint: deep-dive §5.)
Core (trace the code)
- Trace
BlockPool.touch([b])whenb.ref_cnt == 0andbis cached. Which list operation runs, what is its complexity, and which real-world event caused this call? - Trace
get_new_blocks(2)when one popped block is a cached eviction candidate. Which method clears its hash, and why must that happen beforeref_cntis set? KVCacheManager.freefrees blocks in reverse order. Construct a 2-request example where forward order would evict the shared prefix too early.
Build (extend your code)
- Add copy-on-write to your
lab-01pool:fork_block(b)that, whenb.ref_cnt > 1, allocates a new block, decrementsb, and returns the new one. Write a test: two requests share a block, one forks, the other's view is unchanged. - Add a
get_usage()sanity test: usage is 0.0 with only the null block used, and approaches 1.0 as you allocate. Why subtract 1 for the null block (block_pool.py:505)? - Make your
FreeKVCacheBlockQueuetrack eviction order: when freeing a request's blocks tail-first, assert the head (prefix) block ends up behind the tail block in the queue.
Design (staff-level)
- A customer serves one 4k-token system prompt to 1,000 users/min, each adding ~50 tokens. Estimate KV memory with and without prefix caching (pick a model from the guide). What's the multiplier prefix caching buys here, and why is it so large?
- Sketch how you'd add a second block size for a hybrid model (some layers attention, some
Mamba). What breaks in a single-
block_sizedesign? (Peek:kv_cache_coordinator.py,resolve_kv_cache_block_sizesatkv_cache_utils.py:571.) - The free queue is a hand-rolled linked list "to avoid allocating Python objects." Propose a
benchmark that would prove a
dequeis slower here, and predict where the gap shows up.
Self-grading
For 4–6 and 10–12: could you whiteboard it in 5 minutes and name the file? If not, re-read the matching deep-dive section. Bring exercises 10–12 to the INTERVIEW.md drills.