Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 02 — Mini-Build: the paged block allocator

You will build the four pieces from the deep-dive, on CPU, with numpy-or-nothing. The reference implementation already lives in the repo so you can check yourself:

  • mini_vllm/block_pool.pyKVCacheBlock, FreeKVCacheBlockQueue, BlockPool, hash_block_tokens
  • mini_vllm/kv_cache.pyKVCacheManager

But the point is to write it yourself first. lab-01-block-allocator gives you a starter.py with the method bodies stubbed out and a test_lab.py that pins every invariant. Make the tests pass without peeking, then diff your file against solution.py (and against the real mini_vllm/block_pool.py).

Contents


The build, in order

1. KVCacheBlock

A dataclass: block_id, ref_cnt=0, block_hash=None, is_null=False, and two link pointers prev_free/next_free. Add reset_hash(). Don't put list logic here — the queue owns the pointers (mirror the real ownership invariant).

2. FreeKVCacheBlockQueue

A doubly linked list with fake head/tail sentinels. Implement:

  • popleft() → first block, O(1)
  • remove(block) → unlink from the middle, O(1) ← the reason this class exists
  • append(block) → push to tail
  • get_all_free_blocks() → for tests Keep a num_free_blocks counter in sync. Test: removing a middle block keeps the rest ordered (test_free_queue_o1_middle_removal).

3. BlockPool

  • Build num_blocks blocks, wrap them in the queue, reserve block 0 as null_block.
  • get_new_blocks(n) — pop n, _maybe_evict each, set ref_cnt=1.
  • _maybe_evict(block) — if it has a hash and is the cached block for that hash, drop it from the index and reset_hash().
  • touch(blocks) — if a block is free (ref_cnt==0), remove it from the queue, then ref_cnt += 1.
  • free_blocks(blocks)ref_cnt -= 1; any that hit 0 (and aren't null) go back on the queue.
  • cache_full_blocks(blocks, hashes) — set hash + index it (skip null/already-hashed).
  • get_cached_block(hash), get_num_free_blocks(), get_usage().

4. hash_block_tokens(parent_hash, token_ids)

return hash((parent_hash, tuple(token_ids))). The parent chaining is non-negotiable — it's what makes it a prefix cache. Test: same tokens, different parent → different hash.

5. KVCacheManager (in mini_vllm/kv_cache.py)

  • get_computed_blocks(request) → walk full-block hashes from the front, look each up, stop at first miss; cap usable hits at (num_tokens - 1) // block_size. Return (blocks, num_cached).
  • allocate_slots(request, num_new_tokens, new_computed_blocks=None)touch the cached blocks, compute how many blocks the (computed+new) tokens need, return None if not enough free, else allocate and cache newly-full blocks.
  • free(request) → free its blocks in reverse order.

Definition of done

pytest mini_vllm/test_block_pool.py -q     # the reference suite
pytest phase-02-paged-attention/labs -q    # your lab solution + the lab tests

Both green. Then, in your notebook, answer: Which line in your touch() is the O(1) middle removal, and which real-world event triggers it? (Answer: pulling a prefix-cached block out of the free list on a cache hit.)

Stretch (optional, sets up Phase 03)

Add a tiny copy-on-write to your pool: a fork_block(block) that, if ref_cnt > 1, allocates a fresh block, (pretend-)copies contents, decrements the original, and returns the new one. You won't wire it into the engine here, but it's the mechanism behind safe prefix sharing when one sharer diverges — and a classic interview follow-up.