Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 04 — Cheatsheet: Attention Backends

Contents


The one-liner

Attention is one op; the backend is which kernel computes it. Model code is backend-agnostic; the kernel gets paged-KV metadata (block table + slot mapping + seq lens).

The four roles (vllm/v1/attention/backend.py)

  • Attention layer (model-facing, attention.py:177) → delegates to:
  • AttentionImpl.forward — runs the kernel (writes KV via slot_mapping, reads via block table)
  • AttentionBackend — names impl + metadata + supported configs
  • AttentionMetadataBuilder.build — SchedulerOutput → kernel metadata (the Phase 2/3 → kernel bridge)

The kernels

backendbest for
FlashAttentiongeneral default; online softmax, O(N) memory
FlashInferserving, paged KV, high concurrency / many decodes
Tritonportable fallback
FlashMLAMLA models (DeepSeek) — low-rank latent KV
TRTLLM-GENNVIDIA TensorRT-LLM generated, GPU/precision-tuned

Online softmax (why "Flash")

running max + rescale + accumulate per tile → exact softmax in O(N) memory, no N×N matrix.

Selection

get_attn_backend (selector.py:52) ← platform default + dtype + head size + features; override with VLLM_ATTENTION_BACKEND=FLASH_ATTN|FLASHINFER|TRITON_ATTN|.... Fixed for the run (CUDA graphs).

Key upstream

  • model_executor/layers/attention/attention.py:177 Attention · :437 forward
  • v1/attention/backend.py base classes · v1/attention/selector.py:52 selector
  • v1/attention/backends/flash_attn.py :68 Backend :223 Metadata :276 Builder :592 Impl
  • v1/attention/backends/mla/ MLA · registry.py name→backend

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md