Phase 02 Labs — PagedAttention
Six labs that take you from "paging is an idea" to "I have built every layer of it." The arc: build the allocator (lab-01), measure why it wins (lab-02), see it manage real gigabytes (lab-03), then follow the data path — share & evict cached blocks (lab-05), gather through the table in numpy (lab-06), and finally write the GPU kernel (lab-04).
Recommended order: 01 → 02 → 05 → 06 → 03 → 04. (The directory numbers predate labs 05
and 06; the metadata labs first, then the data path, then the hardware.) CPU labs follow the
standard contract — starter.py (your work), solution.py (reference), test_lab.py (the
spec); default runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-02-paged-attention/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-01-block-allocator -q
Contents
- lab-01-block-allocator
[CPU-OK] - lab-02-fragmentation-viz
[CPU-OK] - lab-03-real-vllm-blocks
[GPU-OPT] - lab-04-triton-paged-attn
[GPU-REQ] - lab-05-share-and-evict
[CPU-OK] - lab-06-paged-attention-numpy
[CPU-OK] - What you can do after this phase
Labs
lab-01-block-allocator [CPU-OK]
The lab of the phase. Implement the structure vLLM is famous for: KVCacheBlock, the
doubly-linked free queue with O(1) middle removal, the BlockPool with refcounts and lazy
eviction, and the parent-chained content hash. Four invariants (I1–I4), each one a class of
production bug, each one pinned by a test. Skills: the allocator's constitution; why the
free queue can't be a deque; hash chains = causality.
lab-02-fragmentation-viz [CPU-OK]
Simulate contiguous-max-reservation vs paged allocation on the same request stream and
measure the difference: 8 admissions vs 64, 94% waste vs ~0, on identical memory. The
PagedAttention headline result, re-derived by you in a four-line model. Skills: internal
vs external fragmentation; first-principles capacity modeling; the block_size trade-off.
lab-03-real-vllm-blocks [GPU-OPT]
Run real vLLM and read its memory self-assessment: where # GPU blocks: 8788 comes from
(profile → subtract → carve), why "Maximum concurrency: 68.65x" is the whole serving-
capacity story, and how to sanity-check both from the model config on paper. Captured
annotated run included for the GPU-less. Skills: capacity planning;
gpu_memory_utilization / max_model_len as capacity knobs; reading the startup ritual.
lab-04-triton-paged-attn [GPU-REQ]
The payoff: write a Triton kernel that gathers K/V through a block table and computes
decode attention with online softmax, then verify against a dense reference and compare
with paged_attention_v1.cu. Do lab-06 first — it's the same algorithm without the GPU
dialect. Skills: kernel-level indirection; online softmax; block_table (read) vs
slot_mapping (write); fp32 accumulation.
lab-05-share-and-evict [CPU-OK]
The biography of a cached block: two identical prompts converge on the same physical blocks
(ref_cnt == 2), freed blocks linger as eviction candidates, eviction consumes tails before
shared prefixes (reverse-order free = the policy), and a newcomer revives "dead" blocks from
the middle of the free queue. Includes the num_tokens − 1 hit cap and an exactly-sized
pool that counts every block. Skills: prefix-cache state machine; eviction-as-queue-order;
why the last token always recomputes.
lab-06-paged-attention-numpy [CPU-OK]
The data path, with no kernel noise: build slot_mapping, scatter K/V into a shuffled
physical cache, gather back through the block table, and prove paged attention equals dense
attention to 1e-12 — including a poisoned-tail test that makes masking bugs detonate.
The CPU twin of lab-04. Skills: the slot formula; write-map vs read-map; testing masked
computations by poisoning padding.
What you can do after this phase
Explain — with code you wrote and numbers you measured — why vLLM's memory manager admits
~10× more requests than reservation-based engines; predict a deployment's KV capacity from
HBM size and model config; narrate the full life of a cached block from allocation through
sharing to eviction or revival; and read block_pool.py, kv_cache_manager.py, and the
paged-attention kernels upstream as a peer of their authors. Phase 3 now puts this allocator
under a scheduler.