Phase 04 — Interview Questions: Attention Backends
Q1. Why does vLLM have a pluggable attention-backend system?
Model answer
Attention is one math op, but the fastest (or only available) kernel depends on hardware,
dtype, head size, and model features (MLA, sliding window). A pluggable system lets vLLM pick the
best kernel per setup (FlashAttention, FlashInfer, Triton, FlashMLA, TRTLLM-GEN) and adopt new
ones without touching model code — the model talks only to the Attention layer
(attention.py:177), which delegates to the chosen AttentionImpl.
Q2. What does a paged attention kernel need that a dense one doesn't?
Model answer
The block table (logical→physical block, to gather scattered prior KV), the slot mapping
(where to write this step's new K/V), and per-request sequence lengths (for varlen batching).
These are built each step by the AttentionMetadataBuilder (flash_attn.py:276) from the
scheduler's output — the bridge from Phases 2/3 to the kernel.
Q3. Explain online softmax and why FlashAttention uses it.
Model answer
Naive attention materializes the full N×N score matrix — O(N²) memory. Online softmax streams K/V in tiles, keeping a running max, a rescaled accumulator, and a running denominator, so it computes exact softmax-weighted attention in O(N) memory and stays in fast SRAM. That's the "Flash" in FlashAttention, and it's what makes long-context attention feasible. (You implement it in lab-01.)
Q4. How and when is the backend chosen?
Model answer
At startup, by get_attn_backend (selector.py:52), from the platform default
(platforms/cuda.py etc.), dtype, head size, and model features, with VLLM_ATTENTION_BACKEND
as an override. It's fixed for the run because CUDA-graph capture and the metadata builder depend
on it (Phase 5). MLA models force an MLA backend due to their different KV layout.
Q5. What is MLA and why does it need its own backend?
Model answer
Multi-head Latent Attention (DeepSeek) compresses K/V into a shared low-rank latent vector instead of storing full per-head K/V, shrinking the KV cache a lot. Because the cached representation and the attention math differ, it needs a dedicated kernel/backend (FlashMLA) and a different KV cache layout — an example of the model's attention design dictating the kernel.
Rapid-fire
- Model-facing class?
Attention(attention.py:177). - Three backend roles? Backend (names it), Impl (runs it), MetadataBuilder (feeds it).
- Override env var?
VLLM_ATTENTION_BACKEND. - Read map / write map? block table / slot mapping.
- The trick that makes attention O(N) memory? Online softmax.