Phase 04 — Deep Dive: the attention backend system
Paths relative to
upstream/atv0.22.1 @ 0decac0. The attention stack lives across:vllm/model_executor/layers/attention/attention.py the Attention nn.Module (model-facing) vllm/v1/attention/backend.py AttentionBackend / AttentionImpl / AttentionMetadataBuilder base classes vllm/v1/attention/selector.py get_attn_backend (the picker) vllm/v1/attention/backends/flash_attn.py a complete backend, end to end vllm/v1/attention/backends/{flashinfer,triton_attn,mla/}.py other families vllm/v1/attention/backends/registry.py name -> backend mapping
Contents
- 1. The model-facing
Attentionlayer - 2. The base classes:
backend.py - 3. A complete backend: FlashAttention
- 4. The selector: who picks the backend
- 5. MLA — when the KV layout itself changes
- Reading checklist
1. The model-facing Attention layer
vllm/model_executor/layers/attention/attention.py:177 — class Attention(nn.Module, AttentionLayerBase). This is what LlamaAttention.forward called in Phase 0 (self.attn(q, k, v)). Its __init__ (:189) resolves the backend (via the selector) and instantiates an
AttentionImpl; its forward (:437) hands q,k,v to that impl. The model talks only to this
class — it never knows which kernel runs. That decoupling is the whole point: swap the kernel,
the model is untouched.
2. The base classes: backend.py
vllm/v1/attention/backend.py defines the contract every kernel family implements:
AttentionBackend— static methods naming the impl class, the metadata class, supported head sizes/dtypes, and the KV cache shape.AttentionImpl— theforward(q, k, v, kv_cache, attn_metadata) -> outthat runs the kernel (writes new K/V to the cache viaslot_mapping, reads prior KV via the block table).AttentionMetadataBuilder—build(...)turns the per-step scheduler info (sequence lengths, block tables, slot mapping) into the typed metadata the kernel wants.
This three-part split (Backend names it, Impl runs it, Builder feeds it) repeats across every backend file.
3. A complete backend: FlashAttention
vllm/v1/attention/backends/flash_attn.py:
class FlashAttentionBackend(AttentionBackend)(:68) — the registry entry; declares the impl, metadata, and supported configs.class FlashAttentionMetadata(:223) — the per-step data the kernel needs (block table, seq lens, slot mapping, scheduling for varlen).class FlashAttentionMetadataBuilder(AttentionMetadataBuilder[...])(:276) — builds that metadata from the model runner's inputs each step. This is the bridge from Phases 2/3 to the kernel: the block tables you allocated and the scheduled token counts become kernel arguments here.class FlashAttentionImpl(AttentionImpl)(:592) —forwardcalls the FlashAttention CUDA kernel (viavllm-flash-attn/flash-attn), passing the paged KV cache + metadata.
Read FlashAttentionImpl.forward and find where it (a) writes the new k,v into the KV cache using
slot_mapping, and (b) calls the varlen flash-attn function with the block table. Those two calls
are the read/write maps from the guide, live.
4. The selector: who picks the backend
vllm/v1/attention/selector.py:52 — def get_attn_backend(...). It considers the platform
(current_platform, Phase 17), dtype, head size, whether the model uses MLA / sliding window,
and the VLLM_ATTENTION_BACKEND env override, then returns the backend class. _cached_get_attn_backend
(:106) memoizes it. The platform files (vllm/platforms/cuda.py, rocm.py, cpu.py) provide
the per-hardware default — which is why the same model picks FlashAttention on an A100, a Triton
or FlashInfer path elsewhere, and a CPU kernel on a laptop (Phase 17).
5. MLA — when the KV layout itself changes
vllm/v1/attention/backends/mla/ holds the MLA backends. MLA (DeepSeek) compresses K/V into a
low-rank latent vector, so the KV cache stores something different and needs its own kernel
(FlashMLA). This is why "add a model" (Phase 14) sometimes means "wire up a different attention
backend" — the model's attention design dictates the KV layout dictates the kernel.
Reading checklist
-
Attention.forward— what does the model pass, and what does it NOT know? -
The three base classes in
backend.py— Backend vs Impl vs MetadataBuilder. -
In
FlashAttentionMetadataBuilder.build— which Phase 2/3 outputs become kernel metadata? -
In
FlashAttentionImpl.forward— find the KV write (slot_mapping) and the paged read (block table). -
get_attn_backend— name three factors that change the chosen backend.
Now build it: 02-mini-build.md, then the labs.