Phase 04 — Exercises: Attention Backends
Contents
Warm-up (explain)
- Attention is one operation — so why does vLLM have many attention backends?
- What three pieces of metadata does a paged attention kernel need, and what is each for?
- What is online softmax and what problem does it solve?
Core (trace the code)
- In
Attention.forward(attention.py:437), what does the model pass, and what does it not know about the kernel? - Name the three base classes in
backend.pyand the job of each (Backend / Impl / MetadataBuilder). - In
FlashAttentionImpl.forward(flash_attn.py:592), find the KV write (slot_mapping) and the paged read (block table). How do they map to Phase 2? - List three inputs
get_attn_backend(selector.py:52) uses to pick a backend.
Build (your lab)
- In lab-01, explain why scattering the logical blocks to arbitrary physical ids doesn't change the output. What does that prove about the kernel's contract?
- Extend
paged_online_attentionto multiple query heads (loop or vectorize). Verify against a multi-head dense reference. - Add a causal mask variant (a prefill query at position
pattends only to tokens ≤ p).
Design (staff-level)
- At high concurrency with many short decode requests, FlashInfer often beats FlashAttention. Hypothesize why, and design a benchmark (Phase 18) to confirm it for your workload.
- You're bringing up a new model with a novel attention (e.g. a different KV compression). What parts of the backend system must you implement, and what can you reuse?
- A user reports correct output with
VLLM_ATTENTION_BACKEND=TRITON_ATTNbut garbage with the default. Outline your debugging path and what it implies about the default kernel.
Self-grading
4–7 and 11–13 are interview-grade. Could you draw the layer→impl→kernel path and name the files? If not, re-read 01-deep-dive.md.