Phase 02 — Interview Questions: PagedAttention
This is the topic to own in any LLM-inference interview — it's vLLM's headline idea and a favorite question. Cover each answer, attempt it out loud, then compare. Depth here is the bar for a topic you claim as your specialty.
Q1. What problem does PagedAttention solve, and how?
Model answer
The KV cache is the dominant GPU-memory consumer during serving, and pre-vLLM systems reserved a contiguous per-request buffer sized for the maximum sequence length. That caused massive internal fragmentation (reserve 2048, use 30) and external fragmentation (free memory broken into runs too small for the next contiguous request) — wasting 60–80% of KV memory.
PagedAttention borrows OS virtual memory: split the KV cache into fixed-size blocks (e.g.
16 tokens), keep a global pool, and allocate blocks on demand to each request, tracked by a
per-request block table mapping logical→physical block. Blocks can be anywhere, so
fragmentation drops to at most block_size − 1 tokens per request. The attention kernel reads
the block table to gather scattered KV. Net result: several times more concurrent sequences per
GPU.
Q2. Walk me through the data structures. (Whiteboard them.)
Model answer
KVCacheBlock: metadata for one physical block —block_id,ref_cnt,block_hash, free-list pointers. (kv_cache_utils.py:116)FreeKVCacheBlockQueue: a doubly linked list of free blocks in eviction order, with O(1) middle removal and zero per-op allocation. (kv_cache_utils.py:164)BlockPool: owns all blocks, the free queue, andcached_block_hash_to_block(the prefix-cache index). Methods:get_new_blocks,touch,free_blocks,cache_full_blocks. (block_pool.py:130)KVCacheManager: per-request block tables; the scheduler-facing API (get_computed_blocks,allocate_slots,free). (kv_cache_manager.py:110)
The four invariants: free queue ⟺ ref_cnt==0; block ids stable (no dedup); only full blocks
hashed; cached ≠ unusable.
Q3. Why a custom linked list instead of collections.deque for the free list?
Model answer
Two reasons, both hot-path. (1) On a prefix-cache hit, a block that was a free eviction
candidate must be pulled out of the middle of the free list and revived — that's O(1) in a
doubly linked list but O(n) in a deque. The revival happens in BlockPool.touch
(block_pool.py:402). (2) The list reuses prev/next fields stored on the blocks
themselves, so manipulating it allocates no Python objects — no GC pressure in the
scheduler loop that runs every token step. The upstream docstring at kv_cache_utils.py:164
states exactly this.
Q4. How does prefix caching work on top of paging, and what makes it a prefix cache?
Model answer
Each full block gets a content hash that chains the parent block's hash
(hash_block_tokens, kv_cache_utils.py:541). Chaining means a hit on block k guarantees
blocks 0..k were identical — so it's a true prefix, not just matching content. Hashes index
into cached_block_hash_to_block. A new request computes its block hashes; the manager walks
them from the front, and for each hit it touches the cached block (sharing it via ref_cnt)
instead of recomputing. extra_keys (LoRA id, multimodal content, cache salt) are folded in to
prevent unsafe cross-context collisions.
Q5. What happens when there aren't enough free blocks to extend a running request?
Model answer
KVCacheManager.allocate_slots returns None (kv_cache_manager.py:387). That signals
OOM to the scheduler, which preempts the lowest-priority running request — frees its KV
blocks and moves it back to the waiting queue to be recomputed (or, in some designs, swapped)
later — then retries the allocation. This handshake (None → preempt → retry) is the seam
between memory management (Phase 2) and scheduling (Phase 3). It's the safety valve that lets
vLLM admit aggressively without crashing on memory.
Q6. Copy-on-write — when and why?
Model answer
When two requests share a block (e.g. a common prompt) and one of them needs to write new
tokens into a position within that shared block, you can't mutate it in place without corrupting
the other sharer. So you copy the block (allocate a fresh one, copy contents), point the writer
at the copy, and decrement the original's ref_cnt. It's the same CoW as in OS fork(). In
practice vLLM shares at block granularity and divergence usually starts a new block, so true
intra-block CoW is rare, but the mechanism guarantees correctness when sharers diverge.
Q7. (Deep) Why are blocks freed in reverse order, and why doesn't the cache de-duplicate?
Model answer
Reverse free (kv_cache_manager.py:431): freeing tail blocks first puts them ahead of the
head (prefix) blocks in the eviction queue, so the shared prefix survives longest for future
requests — maximizing prefix-cache hit rate.
No dedup (block_pool.py:48): if the cache de-duplicated identical blocks it might have to
remap an already-allocated block_id, but block tables are append-only (block_id must be
stable once allocated) so the engine never has to rewrite a request's table. The cost is
occasionally storing two blocks with the same content; the benefit is a simpler, race-free
invariant. That tradeoff is exactly the kind of judgment call maintainers make.
Rapid-fire
- Block size is typically? 16. Tradeoff of larger? Less metadata/overhead, more tail waste and coarser sharing.
- What's the null block for? A placeholder for skipped positions (e.g. outside a sliding window); never cached.
- Where does the block table actually get used? Passed into the attention kernel
(
csrc/attention/paged_attention_v1.cu), which dereferences it per token. - What sets the number of GPU blocks? Leftover HBM after weights ÷ per-block bytes
(
kv_cache_utils.get_kv_cache_configs), scaled bygpu_memory_utilization.