Phase 04 — Deep Dive: the attention backend system

Paths relative to upstream/ at v0.22.1 @ 0decac0. The attention stack lives across:

vllm/model_executor/layers/attention/attention.py   the Attention nn.Module (model-facing)
vllm/v1/attention/backend.py                         AttentionBackend / AttentionImpl /
                                                      AttentionMetadataBuilder base classes
vllm/v1/attention/selector.py                        get_attn_backend (the picker)
vllm/v1/attention/backends/flash_attn.py             a complete backend, end to end
vllm/v1/attention/backends/{flashinfer,triton_attn,mla/}.py   other families
vllm/v1/attention/backends/registry.py               name -> backend mapping

1. The model-facing Attention layer
2. The base classes: backend.py
3. A complete backend: FlashAttention
4. The selector: who picks the backend
5. MLA — when the KV layout itself changes
Reading checklist

1. The model-facing `Attention` layer

vllm/model_executor/layers/attention/attention.py:177 — class Attention(nn.Module, AttentionLayerBase). This is what LlamaAttention.forward called in Phase 0 (self.attn(q, k, v)). Its __init__ (:189) resolves the backend (via the selector) and instantiates an AttentionImpl; its forward (:437) hands q,k,v to that impl. The model talks only to this class — it never knows which kernel runs. That decoupling is the whole point: swap the kernel, the model is untouched.

2. The base classes: `backend.py`

vllm/v1/attention/backend.py defines the contract every kernel family implements:

AttentionBackend — static methods naming the impl class, the metadata class, supported head sizes/dtypes, and the KV cache shape.
AttentionImpl — the forward(q, k, v, kv_cache, attn_metadata) -> out that runs the kernel (writes new K/V to the cache via slot_mapping, reads prior KV via the block table).
AttentionMetadataBuilder — build(...) turns the per-step scheduler info (sequence lengths, block tables, slot mapping) into the typed metadata the kernel wants.

This three-part split (Backend names it, Impl runs it, Builder feeds it) repeats across every backend file.

3. A complete backend: FlashAttention

vllm/v1/attention/backends/flash_attn.py:

class FlashAttentionBackend(AttentionBackend) (:68) — the registry entry; declares the impl, metadata, and supported configs.
class FlashAttentionMetadata (:223) — the per-step data the kernel needs (block table, seq lens, slot mapping, scheduling for varlen).
class FlashAttentionMetadataBuilder(AttentionMetadataBuilder[...]) (:276) — builds that metadata from the model runner's inputs each step. This is the bridge from Phases 2/3 to the kernel: the block tables you allocated and the scheduled token counts become kernel arguments here.
class FlashAttentionImpl(AttentionImpl) (:592) — forward calls the FlashAttention CUDA kernel (via vllm-flash-attn/flash-attn), passing the paged KV cache + metadata.

Read FlashAttentionImpl.forward and find where it (a) writes the new k,v into the KV cache using slot_mapping, and (b) calls the varlen flash-attn function with the block table. Those two calls are the read/write maps from the guide, live.

4. The selector: who picks the backend

vllm/v1/attention/selector.py:52 — def get_attn_backend(...). It considers the platform (current_platform, Phase 17), dtype, head size, whether the model uses MLA / sliding window, and the VLLM_ATTENTION_BACKEND env override, then returns the backend class. _cached_get_attn_backend (:106) memoizes it. The platform files (vllm/platforms/cuda.py, rocm.py, cpu.py) provide the per-hardware default — which is why the same model picks FlashAttention on an A100, a Triton or FlashInfer path elsewhere, and a CPU kernel on a laptop (Phase 17).

5. MLA — when the KV layout itself changes

vllm/v1/attention/backends/mla/ holds the MLA backends. MLA (DeepSeek) compresses K/V into a low-rank latent vector, so the KV cache stores something different and needs its own kernel (FlashMLA). This is why "add a model" (Phase 14) sometimes means "wire up a different attention backend" — the model's attention design dictates the KV layout dictates the kernel.

Reading checklist

Attention.forward — what does the model pass, and what does it NOT know?
The three base classes in backend.py — Backend vs Impl vs MetadataBuilder.
In FlashAttentionMetadataBuilder.build — which Phase 2/3 outputs become kernel metadata?
In FlashAttentionImpl.forward — find the KV write (slot_mapping) and the paged read (block table).
get_attn_backend — name three factors that change the chosen backend.

Now build it: 02-mini-build.md, then the labs.

vLLM Mastery — From Zero to Maintainer