Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 02 — Exercises: PagedAttention

Escalating from "explain it" to "design it." Staff-level = you can do the last ones cold, and point to the exact upstream/ file that proves your answer.

Contents


Warm-up (explain)

  1. In one sentence each, define: block, block table, block pool, ref_cnt, null block.
  2. Why is per-request waste bounded by block_size − 1? Where does that one partial block come from?
  3. Why does get_computed_blocks cap hits at num_tokens − 1 and not num_tokens? (Hint: deep-dive §5.)

Core (trace the code)

  1. Trace BlockPool.touch([b]) when b.ref_cnt == 0 and b is cached. Which list operation runs, what is its complexity, and which real-world event caused this call?
  2. Trace get_new_blocks(2) when one popped block is a cached eviction candidate. Which method clears its hash, and why must that happen before ref_cnt is set?
  3. KVCacheManager.free frees blocks in reverse order. Construct a 2-request example where forward order would evict the shared prefix too early.

Build (extend your code)

  1. Add copy-on-write to your lab-01 pool: fork_block(b) that, when b.ref_cnt > 1, allocates a new block, decrements b, and returns the new one. Write a test: two requests share a block, one forks, the other's view is unchanged.
  2. Add a get_usage() sanity test: usage is 0.0 with only the null block used, and approaches 1.0 as you allocate. Why subtract 1 for the null block (block_pool.py:505)?
  3. Make your FreeKVCacheBlockQueue track eviction order: when freeing a request's blocks tail-first, assert the head (prefix) block ends up behind the tail block in the queue.

Design (staff-level)

  1. A customer serves one 4k-token system prompt to 1,000 users/min, each adding ~50 tokens. Estimate KV memory with and without prefix caching (pick a model from the guide). What's the multiplier prefix caching buys here, and why is it so large?
  2. Sketch how you'd add a second block size for a hybrid model (some layers attention, some Mamba). What breaks in a single-block_size design? (Peek: kv_cache_coordinator.py, resolve_kv_cache_block_sizes at kv_cache_utils.py:571.)
  3. The free queue is a hand-rolled linked list "to avoid allocating Python objects." Propose a benchmark that would prove a deque is slower here, and predict where the gap shows up.

Self-grading

For 4–6 and 10–12: could you whiteboard it in 5 minutes and name the file? If not, re-read the matching deep-dive section. Bring exercises 10–12 to the INTERVIEW.md drills.