Phase 02 — Cheatsheet: PagedAttention

The one-liner
Data structures
The four invariants
Key methods → what they do
Hashing
Numbers to know
Gotchas

The one-liner

KV cache → fixed-size blocks (like OS pages) + per-request block table → no fragmentation, plus free sharing (prefix caching, CoW). Waste ≤ block_size − 1 per request.

Data structures

Thing	Job	Upstream
`KVCacheBlock`	per-block metadata (id, ref_cnt, hash, free links)	`kv_cache_utils.py:116`
`FreeKVCacheBlockQueue`	free list, eviction order, O(1) middle removal, zero-alloc	`kv_cache_utils.py:164`
`BlockPool`	owns blocks + free list + prefix-cache index	`block_pool.py:130`
`KVCacheManager`	per-request block tables; scheduler API	`kv_cache_manager.py:110`

The four invariants

in free queue ⟺ ref_cnt == 0 (not null)
block ids are append-only / stable (so: no dedup)
only full blocks get hashed + cached
cached ≠ unusable — touch revives a free cached block (O(1) middle removal)

Key methods → what they do

get_new_blocks(n) — popleft n, _maybe_evict, ref=1
touch(blocks) — re-ref a shared block; remove from free queue if it was free
free_blocks(blocks) — deref; ref→0 returns to queue (stays cached)
cache_full_blocks(...) — hash + index newly-full blocks
get_computed_blocks(req) — prefix-cache lookup, hits capped at num_tokens − 1
allocate_slots(...) — extend a request; returns None on OOM → scheduler preempts
free(req) — frees reverse order (prefix survives longest)

Hashing

hash_block_tokens(parent_hash, tokens, extra_keys) — parent-chained ⇒ prefix cache. extra_keys = LoRA id + multimodal + cache_salt (no cross-context collisions).

Numbers to know

KV bytes/token ≈ 2 × layers × kv_heads × head_dim × dtype_bytes
num GPU blocks ≈ (HBM × gpu_memory_utilization − weights) ÷ per-block bytes
block_size default ≈ 16

Gotchas

Returning None from allocate_slots is normal — it drives preemption, not an error.
The null block (id 0) is reserved; never cache it; subtract it from usage math.
Line numbers valid only at v0.22.1 @ 0decac0; search the named symbol otherwise.

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

vLLM Mastery — From Zero to Maintainer