Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 04-02 — Backend Selection Matrix [GPU-OPT]

vLLM doesn't have an attention kernel; it has a stable of them — FlashAttention, FlashInfer, Triton, FlashMLA, TRTLLM-GEN, per-platform CPU/ROCm/TPU variants — and a selector that picks one at startup based on your GPU, dtype, model architecture, and features. That choice is invisible when it's right and bewildering when it's wrong, and "wrong" here means anything from a 20% throughput gap to a crash on an exotic head size. In this lab you run the selector, override it, read its source, and build the (GPU, dtype, model) → backend table that lets you answer — from memory, in an incident — "which kernel is this deployment actually running, and what else could it run?"

No GPU? Don't panic. The captured output below is the experiment; the selection logic (selector.py:52) is the lesson, and it reads the same on a laptop.

Contents


Why this lab exists

Every component you've studied so far had one implementation. Attention is where vLLM becomes a dispatcher, and dispatchers are where production surprises live: the same model, same config, same vLLM version runs different kernels on an A100 vs an H100 vs an RTX 4090 — different performance, different numerics in the last ulp, occasionally different bugs. When a user reports "works on my machine, garbage on the cluster," the backend matrix is the first thing a maintainer checks, and VLLM_ATTENTION_BACKEND is the first bisection tool they reach for. This lab is that reflex, installed.

It's also your map for the rest of the phase: the deep-dive walks the backends' implementations; this lab establishes which of them you're ever actually running and what forces the exceptions (MLA models, head sizes, dtypes, platforms).

Background: why so many backends

Because "attention" is several workloads wearing one name, and the optimal kernel differs per (shape × hardware × feature):

  • FlashAttention (FA2/FA3) — the battle-tested default for standard transformers on NVIDIA; hand-tuned prefill and decode paths, broad feature support. FA3 exploits Hopper-specific hardware (TMA, warpgroup MMA), which is why the GPU generation enters the selector.
  • FlashInfer — plan-based kernels with strengths vLLM's defaults lack in places: cascade/shared-prefix attention (lab-04's merge!), aggressive split-k, customizable masking. Often the win for high-concurrency or shared-prefix workloads — measure, don't assume (Phase 18).
  • Triton backend — portable, readable, JIT-compiled; the fallback when the hand-written kernels lack your head size/feature combo, and the reference implementation you can actually modify (it's the closest production cousin of your lab-01 code).
  • FlashMLA / TRTLLM-GEN — DeepSeek-style MLA models compress KV into a low-rank latent; the cache layout itself is different, so standard kernels can't read it at all. Architecture doesn't just prefer a backend — it can force one.
  • Platform backends (CPU, ROCm, TPU — Phase 17) — different ISAs entirely.

The selector (get_attn_backend, upstream/vllm/v1/attention/selector.py:52) resolves: explicit override → platform default chain → capability checks (dtype, head size, sliding window, MLA) → fallback. Selection happens once, at startup — the backend's metadata builder and CUDA-graph shapes (Phase 5) are baked for the engine's lifetime.

Requirements

uv pip install -e ".[vllm]"

Steps

  1. Let vLLM pick (read the startup line naming the backend):
python -c "from vllm import LLM; LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4)"
  1. Force alternatives and confirm the engine obeys:
VLLM_ATTENTION_BACKEND=FLASHINFER  python -c "from vllm import LLM; LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4)"
VLLM_ATTENTION_BACKEND=TRITON_ATTN python -c "from vllm import LLM; LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4)"

Also try forcing something invalid for your setup (e.g. FLASHMLA on a non-MLA model) and read the error — the selector's failure messages are part of its interface, and you want to have seen them before an incident shows them to you.

  1. Read the source next to the log: selector.py:52 (get_attn_backend) and the platform default chain in upstream/vllm/platforms/cuda.py. For your GPU + dtype + two or three models, predict the choice before running — the lab is passed when your predictions stop missing.

Captured output (real run, L4, vLLM 0.22.1, trimmed)

# default:
INFO ... Using Flash Attention backend.
# VLLM_ATTENTION_BACKEND=FLASHINFER:
INFO ... Using FlashInfer backend.
# VLLM_ATTENTION_BACKEND=TRITON_ATTN:
INFO ... Using Triton backend.
# a DeepSeek (MLA) model, default:
INFO ... Using FlashMLA backend.       # MLA models force an MLA backend (different KV layout)

One line, easily scrolled past — but it names the code that will execute the hottest loop of the deployment several thousand times per second. Operators should log-grep for it on every rollout; version upgrades do change defaults, silently (selection logic and backend names both drift across releases — anchor on the mechanism, not the strings).

Build the matrix (your deliverable)

GPUdtypemodel featurechosen backendwhy
A100/L4bf16standardFlashAttentionhand-tuned default for Ampere+
H100bf16standardFlashAttention (FA3 path)Hopper-specific kernels
anyanyMLA (DeepSeek)FlashMLAlatent KV layout — standard kernels can't read it
anyanyoverride set(the override)VLLM_ATTENTION_BACKEND wins over everything
anyanyunsupported head sizeTriton fallbackJIT covers shapes hand-written kernels skip
CPUfp32standardthe CPU backendno CUDA; platform chain (Phase 17)

Extend it with what your hardware shows — the table above is the skeleton; the rows you add from your own runs are the ones you'll remember.

Hitchhiker's notes

  • The override is a bisection tool, not a tuning knob. Mystery garbage output? Flip to TRITON_ATTN: if the garbage persists, it's not the kernel (look at sampling, weights, tokenizer); if it disappears, you've isolated a kernel bug and your issue report writes itself ("FA path wrong for head_size=96 + sliding window; Triton correct"). This two-run dance is the single highest-value habit this lab teaches.
  • Backends differ in the last ulp, legitimately. Different tiling = different reduction order = bitwise-different logits (Phase 3 lab-02's softening, kernel edition). Greedy outputs can diverge after enough tokens with no bug anywhere. Don't file that issue; do mention it when comparing backends in evals.
  • Why startup-time selection rather than per-request? The backend brings its own metadata builder (the FlashAttentionMetadata of lab-03) and its kernels are baked into CUDA-graph captures (Phase 5); swapping per request would mean re-capturing graphs and rebuilding paged-cache layouts mid-flight. Selection is configuration, not scheduling.
  • Capability gaps are normal, not shameful: a brand-new model with head_dim 96, or fp8 KV + sliding window, may be outside the fast path's support matrix and silently fall back to Triton — correct but slower. When throughput regresses after a model swap, check the backend line first; the model may have changed your kernel.

Reflect

  • Your p99-latency-sensitive service runs long-context decode on H100s. Name two backend experiments worth running before touching any other knob, and what you'd measure. (FlashInfer split-k vs FA3 at your concurrency, ITL distributions — lab-04 explains why long decode is where they differ; Phase 18 gives the harness.)
  • Why does an MLA model force the backend while sliding-window merely filters candidates? (MLA changes the cache's data layout — incompatible storage; sliding window is a mask variation several backends implement — a feature flag, not a format.)
  • The selector consults the platform (cuda.py, rocm.py, cpu.py …) before capability checks. Sketch how a new accelerator vendor slots in without touching the selector — that's Phase 17's plugin architecture, and the reason the chain is shaped this way.

References