Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 02 Labs — PagedAttention

Six labs that take you from "paging is an idea" to "I have built every layer of it." The arc: build the allocator (lab-01), measure why it wins (lab-02), see it manage real gigabytes (lab-03), then follow the data path — share & evict cached blocks (lab-05), gather through the table in numpy (lab-06), and finally write the GPU kernel (lab-04).

Recommended order: 01 → 02 → 05 → 06 → 03 → 04. (The directory numbers predate labs 05 and 06; the metadata labs first, then the data path, then the hardware.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-02-paged-attention/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-01-block-allocator -q

Contents


Labs

lab-01-block-allocator [CPU-OK]

The lab of the phase. Implement the structure vLLM is famous for: KVCacheBlock, the doubly-linked free queue with O(1) middle removal, the BlockPool with refcounts and lazy eviction, and the parent-chained content hash. Four invariants (I1–I4), each one a class of production bug, each one pinned by a test. Skills: the allocator's constitution; why the free queue can't be a deque; hash chains = causality.

lab-02-fragmentation-viz [CPU-OK]

Simulate contiguous-max-reservation vs paged allocation on the same request stream and measure the difference: 8 admissions vs 64, 94% waste vs ~0, on identical memory. The PagedAttention headline result, re-derived by you in a four-line model. Skills: internal vs external fragmentation; first-principles capacity modeling; the block_size trade-off.

lab-03-real-vllm-blocks [GPU-OPT]

Run real vLLM and read its memory self-assessment: where # GPU blocks: 8788 comes from (profile → subtract → carve), why "Maximum concurrency: 68.65x" is the whole serving- capacity story, and how to sanity-check both from the model config on paper. Captured annotated run included for the GPU-less. Skills: capacity planning; gpu_memory_utilization / max_model_len as capacity knobs; reading the startup ritual.

lab-04-triton-paged-attn [GPU-REQ]

The payoff: write a Triton kernel that gathers K/V through a block table and computes decode attention with online softmax, then verify against a dense reference and compare with paged_attention_v1.cu. Do lab-06 first — it's the same algorithm without the GPU dialect. Skills: kernel-level indirection; online softmax; block_table (read) vs slot_mapping (write); fp32 accumulation.

lab-05-share-and-evict [CPU-OK]

The biography of a cached block: two identical prompts converge on the same physical blocks (ref_cnt == 2), freed blocks linger as eviction candidates, eviction consumes tails before shared prefixes (reverse-order free = the policy), and a newcomer revives "dead" blocks from the middle of the free queue. Includes the num_tokens − 1 hit cap and an exactly-sized pool that counts every block. Skills: prefix-cache state machine; eviction-as-queue-order; why the last token always recomputes.

lab-06-paged-attention-numpy [CPU-OK]

The data path, with no kernel noise: build slot_mapping, scatter K/V into a shuffled physical cache, gather back through the block table, and prove paged attention equals dense attention to 1e-12 — including a poisoned-tail test that makes masking bugs detonate. The CPU twin of lab-04. Skills: the slot formula; write-map vs read-map; testing masked computations by poisoning padding.

What you can do after this phase

Explain — with code you wrote and numbers you measured — why vLLM's memory manager admits ~10× more requests than reservation-based engines; predict a deployment's KV capacity from HBM size and model config; narrate the full life of a cached block from allocation through sharing to eviction or revival; and read block_pool.py, kv_cache_manager.py, and the paged-attention kernels upstream as a peer of their authors. Phase 3 now puts this allocator under a scheduler.