Phase 04 — Cheatsheet: Attention Backends
Contents
- The one-liner
- The four roles (
vllm/v1/attention/backend.py) - The kernels
- Online softmax (why "Flash")
- Selection
- Key upstream
The one-liner
Attention is one op; the backend is which kernel computes it. Model code is backend-agnostic; the kernel gets paged-KV metadata (block table + slot mapping + seq lens).
The four roles (vllm/v1/attention/backend.py)
Attentionlayer (model-facing,attention.py:177) → delegates to:AttentionImpl.forward— runs the kernel (writes KV via slot_mapping, reads via block table)AttentionBackend— names impl + metadata + supported configsAttentionMetadataBuilder.build— SchedulerOutput → kernel metadata (the Phase 2/3 → kernel bridge)
The kernels
| backend | best for |
|---|---|
| FlashAttention | general default; online softmax, O(N) memory |
| FlashInfer | serving, paged KV, high concurrency / many decodes |
| Triton | portable fallback |
| FlashMLA | MLA models (DeepSeek) — low-rank latent KV |
| TRTLLM-GEN | NVIDIA TensorRT-LLM generated, GPU/precision-tuned |
Online softmax (why "Flash")
running max + rescale + accumulate per tile → exact softmax in O(N) memory, no N×N matrix.
Selection
get_attn_backend (selector.py:52) ← platform default + dtype + head size + features; override
with VLLM_ATTENTION_BACKEND=FLASH_ATTN|FLASHINFER|TRITON_ATTN|.... Fixed for the run (CUDA graphs).
Key upstream
model_executor/layers/attention/attention.py:177Attention ·:437forwardv1/attention/backend.pybase classes ·v1/attention/selector.py:52selectorv1/attention/backends/flash_attn.py:68 Backend :223 Metadata :276 Builder :592 Implv1/attention/backends/mla/MLA ·registry.pyname→backend
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md