Phase 02 — Mini-Build: the paged block allocator
You will build the four pieces from the deep-dive, on CPU, with numpy-or-nothing. The reference implementation already lives in the repo so you can check yourself:
mini_vllm/block_pool.py—KVCacheBlock,FreeKVCacheBlockQueue,BlockPool,hash_block_tokensmini_vllm/kv_cache.py—KVCacheManager
But the point is to write it yourself first. lab-01-block-allocator gives you a starter.py
with the method bodies stubbed out and a test_lab.py that pins every invariant. Make the
tests pass without peeking, then diff your file against solution.py (and against the real
mini_vllm/block_pool.py).
Contents
The build, in order
1. KVCacheBlock
A dataclass: block_id, ref_cnt=0, block_hash=None, is_null=False, and two link
pointers prev_free/next_free. Add reset_hash(). Don't put list logic here — the queue
owns the pointers (mirror the real ownership invariant).
2. FreeKVCacheBlockQueue
A doubly linked list with fake head/tail sentinels. Implement:
popleft()→ first block, O(1)remove(block)→ unlink from the middle, O(1) ← the reason this class existsappend(block)→ push to tailget_all_free_blocks()→ for tests Keep anum_free_blockscounter in sync. Test: removing a middle block keeps the rest ordered (test_free_queue_o1_middle_removal).
3. BlockPool
- Build
num_blocksblocks, wrap them in the queue, reserve block 0 asnull_block. get_new_blocks(n)— pop n,_maybe_evicteach, setref_cnt=1._maybe_evict(block)— if it has a hash and is the cached block for that hash, drop it from the index andreset_hash().touch(blocks)— if a block is free (ref_cnt==0),removeit from the queue, thenref_cnt += 1.free_blocks(blocks)—ref_cnt -= 1; any that hit 0 (and aren't null) go back on the queue.cache_full_blocks(blocks, hashes)— set hash + index it (skip null/already-hashed).get_cached_block(hash),get_num_free_blocks(),get_usage().
4. hash_block_tokens(parent_hash, token_ids)
return hash((parent_hash, tuple(token_ids))). The parent chaining is non-negotiable — it's
what makes it a prefix cache. Test: same tokens, different parent → different hash.
5. KVCacheManager (in mini_vllm/kv_cache.py)
get_computed_blocks(request)→ walk full-block hashes from the front, look each up, stop at first miss; cap usable hits at(num_tokens - 1) // block_size. Return(blocks, num_cached).allocate_slots(request, num_new_tokens, new_computed_blocks=None)→touchthe cached blocks, compute how many blocks the (computed+new) tokens need, returnNoneif not enough free, else allocate and cache newly-full blocks.free(request)→ free its blocks in reverse order.
Definition of done
pytest mini_vllm/test_block_pool.py -q # the reference suite
pytest phase-02-paged-attention/labs -q # your lab solution + the lab tests
Both green. Then, in your notebook, answer: Which line in your touch() is the O(1) middle
removal, and which real-world event triggers it? (Answer: pulling a prefix-cached block out of
the free list on a cache hit.)
Stretch (optional, sets up Phase 03)
Add a tiny copy-on-write to your pool: a fork_block(block) that, if ref_cnt > 1,
allocates a fresh block, (pretend-)copies contents, decrements the original, and returns the
new one. You won't wire it into the engine here, but it's the mechanism behind safe prefix
sharing when one sharer diverges — and a classic interview follow-up.