Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 04 — Deep Dive: the attention backend system

Paths relative to upstream/ at v0.22.1 @ 0decac0. The attention stack lives across:

vllm/model_executor/layers/attention/attention.py   the Attention nn.Module (model-facing)
vllm/v1/attention/backend.py                         AttentionBackend / AttentionImpl /
                                                      AttentionMetadataBuilder base classes
vllm/v1/attention/selector.py                        get_attn_backend (the picker)
vllm/v1/attention/backends/flash_attn.py             a complete backend, end to end
vllm/v1/attention/backends/{flashinfer,triton_attn,mla/}.py   other families
vllm/v1/attention/backends/registry.py               name -> backend mapping

Contents


1. The model-facing Attention layer

vllm/model_executor/layers/attention/attention.py:177class Attention(nn.Module, AttentionLayerBase). This is what LlamaAttention.forward called in Phase 0 (self.attn(q, k, v)). Its __init__ (:189) resolves the backend (via the selector) and instantiates an AttentionImpl; its forward (:437) hands q,k,v to that impl. The model talks only to this class — it never knows which kernel runs. That decoupling is the whole point: swap the kernel, the model is untouched.

2. The base classes: backend.py

vllm/v1/attention/backend.py defines the contract every kernel family implements:

  • AttentionBackend — static methods naming the impl class, the metadata class, supported head sizes/dtypes, and the KV cache shape.
  • AttentionImpl — the forward(q, k, v, kv_cache, attn_metadata) -> out that runs the kernel (writes new K/V to the cache via slot_mapping, reads prior KV via the block table).
  • AttentionMetadataBuilderbuild(...) turns the per-step scheduler info (sequence lengths, block tables, slot mapping) into the typed metadata the kernel wants.

This three-part split (Backend names it, Impl runs it, Builder feeds it) repeats across every backend file.

3. A complete backend: FlashAttention

vllm/v1/attention/backends/flash_attn.py:

  • class FlashAttentionBackend(AttentionBackend) (:68) — the registry entry; declares the impl, metadata, and supported configs.
  • class FlashAttentionMetadata (:223) — the per-step data the kernel needs (block table, seq lens, slot mapping, scheduling for varlen).
  • class FlashAttentionMetadataBuilder(AttentionMetadataBuilder[...]) (:276) — builds that metadata from the model runner's inputs each step. This is the bridge from Phases 2/3 to the kernel: the block tables you allocated and the scheduled token counts become kernel arguments here.
  • class FlashAttentionImpl(AttentionImpl) (:592) — forward calls the FlashAttention CUDA kernel (via vllm-flash-attn/flash-attn), passing the paged KV cache + metadata.

Read FlashAttentionImpl.forward and find where it (a) writes the new k,v into the KV cache using slot_mapping, and (b) calls the varlen flash-attn function with the block table. Those two calls are the read/write maps from the guide, live.

4. The selector: who picks the backend

vllm/v1/attention/selector.py:52def get_attn_backend(...). It considers the platform (current_platform, Phase 17), dtype, head size, whether the model uses MLA / sliding window, and the VLLM_ATTENTION_BACKEND env override, then returns the backend class. _cached_get_attn_backend (:106) memoizes it. The platform files (vllm/platforms/cuda.py, rocm.py, cpu.py) provide the per-hardware default — which is why the same model picks FlashAttention on an A100, a Triton or FlashInfer path elsewhere, and a CPU kernel on a laptop (Phase 17).

5. MLA — when the KV layout itself changes

vllm/v1/attention/backends/mla/ holds the MLA backends. MLA (DeepSeek) compresses K/V into a low-rank latent vector, so the KV cache stores something different and needs its own kernel (FlashMLA). This is why "add a model" (Phase 14) sometimes means "wire up a different attention backend" — the model's attention design dictates the KV layout dictates the kernel.

Reading checklist

  • Attention.forward — what does the model pass, and what does it NOT know?
  • The three base classes in backend.py — Backend vs Impl vs MetadataBuilder.
  • In FlashAttentionMetadataBuilder.build — which Phase 2/3 outputs become kernel metadata?
  • In FlashAttentionImpl.forward — find the KV write (slot_mapping) and the paged read (block table).
  • get_attn_backend — name three factors that change the chosen backend.

Now build it: 02-mini-build.md, then the labs.