Phase 04 — Cheatsheet: Attention Backends

The one-liner
The four roles (vllm/v1/attention/backend.py)
The kernels
Online softmax (why "Flash")
Selection
Key upstream

The one-liner

Attention is one op; the backend is which kernel computes it. Model code is backend-agnostic; the kernel gets paged-KV metadata (block table + slot mapping + seq lens).

The four roles (`vllm/v1/attention/backend.py`)

Attention layer (model-facing, attention.py:177) → delegates to:
AttentionImpl.forward — runs the kernel (writes KV via slot_mapping, reads via block table)
AttentionBackend — names impl + metadata + supported configs
AttentionMetadataBuilder.build — SchedulerOutput → kernel metadata (the Phase 2/3 → kernel bridge)

The kernels

backend	best for
FlashAttention	general default; online softmax, O(N) memory
FlashInfer	serving, paged KV, high concurrency / many decodes
Triton	portable fallback
FlashMLA	MLA models (DeepSeek) — low-rank latent KV
TRTLLM-GEN	NVIDIA TensorRT-LLM generated, GPU/precision-tuned

Online softmax (why "Flash")

running max + rescale + accumulate per tile → exact softmax in O(N) memory, no N×N matrix.

Selection

get_attn_backend (selector.py:52) ← platform default + dtype + head size + features; override with VLLM_ATTENTION_BACKEND=FLASH_ATTN|FLASHINFER|TRITON_ATTN|.... Fixed for the run (CUDA graphs).

Key upstream

model_executor/layers/attention/attention.py:177 Attention · :437 forward
v1/attention/backend.py base classes · v1/attention/selector.py:52 selector
v1/attention/backends/flash_attn.py :68 Backend :223 Metadata :276 Builder :592 Impl
v1/attention/backends/mla/ MLA · registry.py name→backend

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

vLLM Mastery — From Zero to Maintainer