Lab 04-02 — Backend Selection Matrix [GPU-OPT]
vLLM doesn't have an attention kernel; it has a stable of them — FlashAttention, FlashInfer, Triton, FlashMLA, TRTLLM-GEN, per-platform CPU/ROCm/TPU variants — and a selector that picks one at startup based on your GPU, dtype, model architecture, and features. That choice is invisible when it's right and bewildering when it's wrong, and "wrong" here means anything from a 20% throughput gap to a crash on an exotic head size. In this lab you run the selector, override it, read its source, and build the (GPU, dtype, model) → backend table that lets you answer — from memory, in an incident — "which kernel is this deployment actually running, and what else could it run?"
No GPU? Don't panic. The captured output below is the experiment; the selection logic (
selector.py:52) is the lesson, and it reads the same on a laptop.
Contents
- Why this lab exists
- Background: why so many backends
- Requirements
- Steps
- Captured output (real run, L4, vLLM 0.22.1, trimmed)
- Build the matrix (your deliverable)
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Every component you've studied so far had one implementation. Attention is where vLLM
becomes a dispatcher, and dispatchers are where production surprises live: the same
model, same config, same vLLM version runs different kernels on an A100 vs an H100 vs
an RTX 4090 — different performance, different numerics in the last ulp, occasionally
different bugs. When a user reports "works on my machine, garbage on the cluster," the
backend matrix is the first thing a maintainer checks, and VLLM_ATTENTION_BACKEND is the
first bisection tool they reach for. This lab is that reflex, installed.
It's also your map for the rest of the phase: the deep-dive walks the backends' implementations; this lab establishes which of them you're ever actually running and what forces the exceptions (MLA models, head sizes, dtypes, platforms).
Background: why so many backends
Because "attention" is several workloads wearing one name, and the optimal kernel differs per (shape × hardware × feature):
- FlashAttention (FA2/FA3) — the battle-tested default for standard transformers on NVIDIA; hand-tuned prefill and decode paths, broad feature support. FA3 exploits Hopper-specific hardware (TMA, warpgroup MMA), which is why the GPU generation enters the selector.
- FlashInfer — plan-based kernels with strengths vLLM's defaults lack in places: cascade/shared-prefix attention (lab-04's merge!), aggressive split-k, customizable masking. Often the win for high-concurrency or shared-prefix workloads — measure, don't assume (Phase 18).
- Triton backend — portable, readable, JIT-compiled; the fallback when the hand-written kernels lack your head size/feature combo, and the reference implementation you can actually modify (it's the closest production cousin of your lab-01 code).
- FlashMLA / TRTLLM-GEN — DeepSeek-style MLA models compress KV into a low-rank latent; the cache layout itself is different, so standard kernels can't read it at all. Architecture doesn't just prefer a backend — it can force one.
- Platform backends (CPU, ROCm, TPU — Phase 17) — different ISAs entirely.
The selector (get_attn_backend, upstream/vllm/v1/attention/selector.py:52) resolves:
explicit override → platform default chain → capability checks (dtype, head size,
sliding window, MLA) → fallback. Selection happens once, at startup — the backend's
metadata builder and CUDA-graph shapes (Phase 5) are baked for the engine's lifetime.
Requirements
uv pip install -e ".[vllm]"
Steps
- Let vLLM pick (read the startup line naming the backend):
python -c "from vllm import LLM; LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4)"
- Force alternatives and confirm the engine obeys:
VLLM_ATTENTION_BACKEND=FLASHINFER python -c "from vllm import LLM; LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4)"
VLLM_ATTENTION_BACKEND=TRITON_ATTN python -c "from vllm import LLM; LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4)"
Also try forcing something invalid for your setup (e.g. FLASHMLA on a non-MLA
model) and read the error — the selector's failure messages are part of its interface,
and you want to have seen them before an incident shows them to you.
- Read the source next to the log:
selector.py:52(get_attn_backend) and the platform default chain inupstream/vllm/platforms/cuda.py. For your GPU + dtype + two or three models, predict the choice before running — the lab is passed when your predictions stop missing.
Captured output (real run, L4, vLLM 0.22.1, trimmed)
# default:
INFO ... Using Flash Attention backend.
# VLLM_ATTENTION_BACKEND=FLASHINFER:
INFO ... Using FlashInfer backend.
# VLLM_ATTENTION_BACKEND=TRITON_ATTN:
INFO ... Using Triton backend.
# a DeepSeek (MLA) model, default:
INFO ... Using FlashMLA backend. # MLA models force an MLA backend (different KV layout)
One line, easily scrolled past — but it names the code that will execute the hottest loop of the deployment several thousand times per second. Operators should log-grep for it on every rollout; version upgrades do change defaults, silently (selection logic and backend names both drift across releases — anchor on the mechanism, not the strings).
Build the matrix (your deliverable)
| GPU | dtype | model feature | chosen backend | why |
|---|---|---|---|---|
| A100/L4 | bf16 | standard | FlashAttention | hand-tuned default for Ampere+ |
| H100 | bf16 | standard | FlashAttention (FA3 path) | Hopper-specific kernels |
| any | any | MLA (DeepSeek) | FlashMLA | latent KV layout — standard kernels can't read it |
| any | any | override set | (the override) | VLLM_ATTENTION_BACKEND wins over everything |
| any | any | unsupported head size | Triton fallback | JIT covers shapes hand-written kernels skip |
| CPU | fp32 | standard | the CPU backend | no CUDA; platform chain (Phase 17) |
Extend it with what your hardware shows — the table above is the skeleton; the rows you add from your own runs are the ones you'll remember.
Hitchhiker's notes
- The override is a bisection tool, not a tuning knob. Mystery garbage output? Flip
to
TRITON_ATTN: if the garbage persists, it's not the kernel (look at sampling, weights, tokenizer); if it disappears, you've isolated a kernel bug and your issue report writes itself ("FA path wrong for head_size=96 + sliding window; Triton correct"). This two-run dance is the single highest-value habit this lab teaches. - Backends differ in the last ulp, legitimately. Different tiling = different reduction order = bitwise-different logits (Phase 3 lab-02's softening, kernel edition). Greedy outputs can diverge after enough tokens with no bug anywhere. Don't file that issue; do mention it when comparing backends in evals.
- Why startup-time selection rather than per-request? The backend brings its own
metadata builder (the
FlashAttentionMetadataof lab-03) and its kernels are baked into CUDA-graph captures (Phase 5); swapping per request would mean re-capturing graphs and rebuilding paged-cache layouts mid-flight. Selection is configuration, not scheduling. - Capability gaps are normal, not shameful: a brand-new model with head_dim 96, or fp8 KV + sliding window, may be outside the fast path's support matrix and silently fall back to Triton — correct but slower. When throughput regresses after a model swap, check the backend line first; the model may have changed your kernel.
Reflect
- Your p99-latency-sensitive service runs long-context decode on H100s. Name two backend experiments worth running before touching any other knob, and what you'd measure. (FlashInfer split-k vs FA3 at your concurrency, ITL distributions — lab-04 explains why long decode is where they differ; Phase 18 gives the harness.)
- Why does an MLA model force the backend while sliding-window merely filters candidates? (MLA changes the cache's data layout — incompatible storage; sliding window is a mask variation several backends implement — a feature flag, not a format.)
- The selector consults the platform (
cuda.py,rocm.py,cpu.py…) before capability checks. Sketch how a new accelerator vendor slots in without touching the selector — that's Phase 17's plugin architecture, and the reason the chain is shaped this way.
References
upstream/vllm/v1/attention/selector.py:52—get_attn_backend, the dispatcher.upstream/vllm/platforms/cuda.py— the NVIDIA default chain the selector consults.upstream/vllm/v1/attention/backends/— the stable itself; skim each file's class docstring and you've got the cast list for the deep-dive.- vLLM docs, Engine Arguments / environment variables —
VLLM_ATTENTION_BACKENDand friends: https://docs.vllm.ai/en/latest/serving/engine_args.html - Ye et al., FlashInfer (2024) — what the alternative brings: https://arxiv.org/abs/2501.01005
- Dao, FlashAttention-2/3 — what the default brings: https://arxiv.org/abs/2307.08691, https://arxiv.org/abs/2407.08608