Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 04 — Interview Questions: Attention Backends

Q1. Why does vLLM have a pluggable attention-backend system?

Model answer

Attention is one math op, but the fastest (or only available) kernel depends on hardware, dtype, head size, and model features (MLA, sliding window). A pluggable system lets vLLM pick the best kernel per setup (FlashAttention, FlashInfer, Triton, FlashMLA, TRTLLM-GEN) and adopt new ones without touching model code — the model talks only to the Attention layer (attention.py:177), which delegates to the chosen AttentionImpl.

Q2. What does a paged attention kernel need that a dense one doesn't?

Model answer

The block table (logical→physical block, to gather scattered prior KV), the slot mapping (where to write this step's new K/V), and per-request sequence lengths (for varlen batching). These are built each step by the AttentionMetadataBuilder (flash_attn.py:276) from the scheduler's output — the bridge from Phases 2/3 to the kernel.

Q3. Explain online softmax and why FlashAttention uses it.

Model answer

Naive attention materializes the full N×N score matrix — O(N²) memory. Online softmax streams K/V in tiles, keeping a running max, a rescaled accumulator, and a running denominator, so it computes exact softmax-weighted attention in O(N) memory and stays in fast SRAM. That's the "Flash" in FlashAttention, and it's what makes long-context attention feasible. (You implement it in lab-01.)

Q4. How and when is the backend chosen?

Model answer

At startup, by get_attn_backend (selector.py:52), from the platform default (platforms/cuda.py etc.), dtype, head size, and model features, with VLLM_ATTENTION_BACKEND as an override. It's fixed for the run because CUDA-graph capture and the metadata builder depend on it (Phase 5). MLA models force an MLA backend due to their different KV layout.

Q5. What is MLA and why does it need its own backend?

Model answer

Multi-head Latent Attention (DeepSeek) compresses K/V into a shared low-rank latent vector instead of storing full per-head K/V, shrinking the KV cache a lot. Because the cached representation and the attention math differ, it needs a dedicated kernel/backend (FlashMLA) and a different KV cache layout — an example of the model's attention design dictating the kernel.

Rapid-fire

  • Model-facing class? Attention (attention.py:177).
  • Three backend roles? Backend (names it), Impl (runs it), MetadataBuilder (feeds it).
  • Override env var? VLLM_ATTENTION_BACKEND.
  • Read map / write map? block table / slot mapping.
  • The trick that makes attention O(N) memory? Online softmax.