Phase 04 — Exercises: Attention Backends

Warm-up (explain)

Attention is one operation — so why does vLLM have many attention backends?
What three pieces of metadata does a paged attention kernel need, and what is each for?
What is online softmax and what problem does it solve?

In Attention.forward (attention.py:437), what does the model pass, and what does it not know about the kernel?
Name the three base classes in backend.py and the job of each (Backend / Impl / MetadataBuilder).
In FlashAttentionImpl.forward (flash_attn.py:592), find the KV write (slot_mapping) and the paged read (block table). How do they map to Phase 2?
List three inputs get_attn_backend (selector.py:52) uses to pick a backend.

In lab-01, explain why scattering the logical blocks to arbitrary physical ids doesn't change the output. What does that prove about the kernel's contract?
Extend paged_online_attention to multiple query heads (loop or vectorize). Verify against a multi-head dense reference.
Add a causal mask variant (a prefill query at position p attends only to tokens ≤ p).

At high concurrency with many short decode requests, FlashInfer often beats FlashAttention. Hypothesize why, and design a benchmark (Phase 18) to confirm it for your workload.
You're bringing up a new model with a novel attention (e.g. a different KV compression). What parts of the backend system must you implement, and what can you reuse?
A user reports correct output with VLLM_ATTENTION_BACKEND=TRITON_ATTN but garbage with the default. Outline your debugging path and what it implies about the default kernel.

4–7 and 11–13 are interview-grade. Could you draw the layer→impl→kernel path and name the files? If not, re-read 01-deep-dive.md.