Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 02 — Cheatsheet: PagedAttention

Contents


The one-liner

KV cache → fixed-size blocks (like OS pages) + per-request block table → no fragmentation, plus free sharing (prefix caching, CoW). Waste ≤ block_size − 1 per request.

Data structures

ThingJobUpstream
KVCacheBlockper-block metadata (id, ref_cnt, hash, free links)kv_cache_utils.py:116
FreeKVCacheBlockQueuefree list, eviction order, O(1) middle removal, zero-allockv_cache_utils.py:164
BlockPoolowns blocks + free list + prefix-cache indexblock_pool.py:130
KVCacheManagerper-request block tables; scheduler APIkv_cache_manager.py:110

The four invariants

  1. in free queue ⟺ ref_cnt == 0 (not null)
  2. block ids are append-only / stable (so: no dedup)
  3. only full blocks get hashed + cached
  4. cached ≠ unusable — touch revives a free cached block (O(1) middle removal)

Key methods → what they do

  • get_new_blocks(n) — popleft n, _maybe_evict, ref=1
  • touch(blocks) — re-ref a shared block; remove from free queue if it was free
  • free_blocks(blocks) — deref; ref→0 returns to queue (stays cached)
  • cache_full_blocks(...) — hash + index newly-full blocks
  • get_computed_blocks(req) — prefix-cache lookup, hits capped at num_tokens − 1
  • allocate_slots(...) — extend a request; returns None on OOM → scheduler preempts
  • free(req) — frees reverse order (prefix survives longest)

Hashing

hash_block_tokens(parent_hash, tokens, extra_keys)parent-chained ⇒ prefix cache. extra_keys = LoRA id + multimodal + cache_salt (no cross-context collisions).

Numbers to know

  • KV bytes/token ≈ 2 × layers × kv_heads × head_dim × dtype_bytes
  • num GPU blocks ≈ (HBM × gpu_memory_utilization − weights) ÷ per-block bytes
  • block_size default ≈ 16

Gotchas

  • Returning None from allocate_slots is normal — it drives preemption, not an error.
  • The null block (id 0) is reserved; never cache it; subtract it from usage math.
  • Line numbers valid only at v0.22.1 @ 0decac0; search the named symbol otherwise.

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md