Phase 02 — Cheatsheet: PagedAttention
Contents
- The one-liner
- Data structures
- The four invariants
- Key methods → what they do
- Hashing
- Numbers to know
- Gotchas
The one-liner
KV cache → fixed-size blocks (like OS pages) + per-request block table → no
fragmentation, plus free sharing (prefix caching, CoW). Waste ≤ block_size − 1 per request.
Data structures
| Thing | Job | Upstream |
|---|---|---|
KVCacheBlock | per-block metadata (id, ref_cnt, hash, free links) | kv_cache_utils.py:116 |
FreeKVCacheBlockQueue | free list, eviction order, O(1) middle removal, zero-alloc | kv_cache_utils.py:164 |
BlockPool | owns blocks + free list + prefix-cache index | block_pool.py:130 |
KVCacheManager | per-request block tables; scheduler API | kv_cache_manager.py:110 |
The four invariants
- in free queue ⟺
ref_cnt == 0(not null) - block ids are append-only / stable (so: no dedup)
- only full blocks get hashed + cached
- cached ≠ unusable —
touchrevives a free cached block (O(1) middle removal)
Key methods → what they do
get_new_blocks(n)— popleft n,_maybe_evict, ref=1touch(blocks)— re-ref a shared block; remove from free queue if it was freefree_blocks(blocks)— deref; ref→0 returns to queue (stays cached)cache_full_blocks(...)— hash + index newly-full blocksget_computed_blocks(req)— prefix-cache lookup, hits capped atnum_tokens − 1allocate_slots(...)— extend a request; returnsNoneon OOM → scheduler preemptsfree(req)— frees reverse order (prefix survives longest)
Hashing
hash_block_tokens(parent_hash, tokens, extra_keys) — parent-chained ⇒ prefix cache.
extra_keys = LoRA id + multimodal + cache_salt (no cross-context collisions).
Numbers to know
- KV bytes/token ≈
2 × layers × kv_heads × head_dim × dtype_bytes - num GPU blocks ≈ (HBM × gpu_memory_utilization − weights) ÷ per-block bytes
- block_size default ≈ 16
Gotchas
- Returning
Nonefromallocate_slotsis normal — it drives preemption, not an error. - The null block (id 0) is reserved; never cache it; subtract it from usage math.
- Line numbers valid only at
v0.22.1 @ 0decac0; search the named symbol otherwise.
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md