Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 02 — Interview Questions: PagedAttention

This is the topic to own in any LLM-inference interview — it's vLLM's headline idea and a favorite question. Cover each answer, attempt it out loud, then compare. Depth here is the bar for a topic you claim as your specialty.

Q1. What problem does PagedAttention solve, and how?

Model answer

The KV cache is the dominant GPU-memory consumer during serving, and pre-vLLM systems reserved a contiguous per-request buffer sized for the maximum sequence length. That caused massive internal fragmentation (reserve 2048, use 30) and external fragmentation (free memory broken into runs too small for the next contiguous request) — wasting 60–80% of KV memory.

PagedAttention borrows OS virtual memory: split the KV cache into fixed-size blocks (e.g. 16 tokens), keep a global pool, and allocate blocks on demand to each request, tracked by a per-request block table mapping logical→physical block. Blocks can be anywhere, so fragmentation drops to at most block_size − 1 tokens per request. The attention kernel reads the block table to gather scattered KV. Net result: several times more concurrent sequences per GPU.

Q2. Walk me through the data structures. (Whiteboard them.)

Model answer
  • KVCacheBlock: metadata for one physical block — block_id, ref_cnt, block_hash, free-list pointers. (kv_cache_utils.py:116)
  • FreeKVCacheBlockQueue: a doubly linked list of free blocks in eviction order, with O(1) middle removal and zero per-op allocation. (kv_cache_utils.py:164)
  • BlockPool: owns all blocks, the free queue, and cached_block_hash_to_block (the prefix-cache index). Methods: get_new_blocks, touch, free_blocks, cache_full_blocks. (block_pool.py:130)
  • KVCacheManager: per-request block tables; the scheduler-facing API (get_computed_blocks, allocate_slots, free). (kv_cache_manager.py:110)

The four invariants: free queue ⟺ ref_cnt==0; block ids stable (no dedup); only full blocks hashed; cached ≠ unusable.

Q3. Why a custom linked list instead of collections.deque for the free list?

Model answer

Two reasons, both hot-path. (1) On a prefix-cache hit, a block that was a free eviction candidate must be pulled out of the middle of the free list and revived — that's O(1) in a doubly linked list but O(n) in a deque. The revival happens in BlockPool.touch (block_pool.py:402). (2) The list reuses prev/next fields stored on the blocks themselves, so manipulating it allocates no Python objects — no GC pressure in the scheduler loop that runs every token step. The upstream docstring at kv_cache_utils.py:164 states exactly this.

Q4. How does prefix caching work on top of paging, and what makes it a prefix cache?

Model answer

Each full block gets a content hash that chains the parent block's hash (hash_block_tokens, kv_cache_utils.py:541). Chaining means a hit on block k guarantees blocks 0..k were identical — so it's a true prefix, not just matching content. Hashes index into cached_block_hash_to_block. A new request computes its block hashes; the manager walks them from the front, and for each hit it touches the cached block (sharing it via ref_cnt) instead of recomputing. extra_keys (LoRA id, multimodal content, cache salt) are folded in to prevent unsafe cross-context collisions.

Q5. What happens when there aren't enough free blocks to extend a running request?

Model answer

KVCacheManager.allocate_slots returns None (kv_cache_manager.py:387). That signals OOM to the scheduler, which preempts the lowest-priority running request — frees its KV blocks and moves it back to the waiting queue to be recomputed (or, in some designs, swapped) later — then retries the allocation. This handshake (None → preempt → retry) is the seam between memory management (Phase 2) and scheduling (Phase 3). It's the safety valve that lets vLLM admit aggressively without crashing on memory.

Q6. Copy-on-write — when and why?

Model answer

When two requests share a block (e.g. a common prompt) and one of them needs to write new tokens into a position within that shared block, you can't mutate it in place without corrupting the other sharer. So you copy the block (allocate a fresh one, copy contents), point the writer at the copy, and decrement the original's ref_cnt. It's the same CoW as in OS fork(). In practice vLLM shares at block granularity and divergence usually starts a new block, so true intra-block CoW is rare, but the mechanism guarantees correctness when sharers diverge.

Q7. (Deep) Why are blocks freed in reverse order, and why doesn't the cache de-duplicate?

Model answer

Reverse free (kv_cache_manager.py:431): freeing tail blocks first puts them ahead of the head (prefix) blocks in the eviction queue, so the shared prefix survives longest for future requests — maximizing prefix-cache hit rate.

No dedup (block_pool.py:48): if the cache de-duplicated identical blocks it might have to remap an already-allocated block_id, but block tables are append-only (block_id must be stable once allocated) so the engine never has to rewrite a request's table. The cost is occasionally storing two blocks with the same content; the benefit is a simpler, race-free invariant. That tradeoff is exactly the kind of judgment call maintainers make.

Rapid-fire

  • Block size is typically? 16. Tradeoff of larger? Less metadata/overhead, more tail waste and coarser sharing.
  • What's the null block for? A placeholder for skipped positions (e.g. outside a sliding window); never cached.
  • Where does the block table actually get used? Passed into the attention kernel (csrc/attention/paged_attention_v1.cu), which dereferences it per token.
  • What sets the number of GPU blocks? Leftover HBM after weights ÷ per-block bytes (kv_cache_utils.get_kv_cache_configs), scaled by gpu_memory_utilization.