Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 04 — Exercises: Attention Backends

Contents


Warm-up (explain)

  1. Attention is one operation — so why does vLLM have many attention backends?
  2. What three pieces of metadata does a paged attention kernel need, and what is each for?
  3. What is online softmax and what problem does it solve?

Core (trace the code)

  1. In Attention.forward (attention.py:437), what does the model pass, and what does it not know about the kernel?
  2. Name the three base classes in backend.py and the job of each (Backend / Impl / MetadataBuilder).
  3. In FlashAttentionImpl.forward (flash_attn.py:592), find the KV write (slot_mapping) and the paged read (block table). How do they map to Phase 2?
  4. List three inputs get_attn_backend (selector.py:52) uses to pick a backend.

Build (your lab)

  1. In lab-01, explain why scattering the logical blocks to arbitrary physical ids doesn't change the output. What does that prove about the kernel's contract?
  2. Extend paged_online_attention to multiple query heads (loop or vectorize). Verify against a multi-head dense reference.
  3. Add a causal mask variant (a prefill query at position p attends only to tokens ≤ p).

Design (staff-level)

  1. At high concurrency with many short decode requests, FlashInfer often beats FlashAttention. Hypothesize why, and design a benchmark (Phase 18) to confirm it for your workload.
  2. You're bringing up a new model with a novel attention (e.g. a different KV compression). What parts of the backend system must you implement, and what can you reuse?
  3. A user reports correct output with VLLM_ATTENTION_BACKEND=TRITON_ATTN but garbage with the default. Outline your debugging path and what it implies about the default kernel.

Self-grading

4–7 and 11–13 are interview-grade. Could you draw the layer→impl→kernel path and name the files? If not, re-read 01-deep-dive.md.