Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Glossary

Every term used in this course, defined once, in plain language. When a phase uses a term, it links here. If you ever feel lost, this is the place to land.

Ordering is roughly conceptual, grouped by theme, not alphabetical — read top to bottom the first time, then use Ctrl-F.

Contents


The model and the forward pass

  • Token — a chunk of text (often ~¾ of a word) that the model reads and writes. Text is turned into a list of integer token IDs by a tokenizer.
  • Embedding — the vector the model uses to represent a token internally.
  • Forward pass — running the model once over some tokens to produce, for the last position, a probability distribution over the next token (the logits).
  • Logits — the raw, pre-softmax scores over the whole vocabulary for "what comes next".
  • Autoregressive generation — generate one token, append it to the input, run the forward pass again, repeat. LLMs generate text one token at a time this way.
  • Decoder-only model — the architecture of GPT/Llama/Qwen: a stack of transformer blocks that only attend to earlier tokens (causal). Most LLMs.

Attention and the KV cache

  • Attention — the operation where each token "looks at" previous tokens and mixes in their information. For each token it computes a Query (Q), and compares it against the Key (K) and Value (V) of every earlier token.
  • KV cache — because earlier tokens don't change, their K and V vectors can be computed once and cached. The KV cache is the stored K and V for every token generated so far. It is the single biggest consumer of GPU memory during serving. This course is largely the story of managing it well.
  • Prefill — the first forward pass over the whole prompt at once. Compute-bound (lots of tokens, one pass). Fills the KV cache for the prompt.
  • Decode — each subsequent single-token forward pass. Memory-bandwidth-bound (one token, must read all weights + the whole KV cache). This is where most serving time goes.
  • TTFT (time to first token) — latency from request arrival to the first output token. Dominated by prefill.
  • ITL / TPOT (inter-token latency / time per output token) — time between successive output tokens. Dominated by decode.

PagedAttention & memory (Phase 2)

  • PagedAttention — vLLM's core idea: store the KV cache in fixed-size blocks (like OS memory pages) instead of one big contiguous buffer per request. Eliminates fragmentation and enables sharing.
  • Block (KV block) — a fixed-size slot holding the KV of block_size tokens (commonly 16). The unit of KV allocation. Code: KVCacheBlock in kv_cache_utils.py.
  • Block size — number of tokens whose KV fits in one block (e.g. 16).
  • Block table — per-request mapping from logical block index → physical block ID. Lets a request's KV be scattered across non-contiguous physical blocks.
  • Block pool — the global pool of all physical blocks, with a free list and a prefix-cache index. Code: BlockPool in block_pool.py.
  • Fragmentation — wasted memory from reserving contiguous space you don't fully use. PagedAttention's reason for existing.
  • Prefix caching — if two requests share a prefix (same leading tokens), they can share the same physical KV blocks. Found by hashing block contents. Phase 3.
  • Copy-on-write (CoW) — when a shared block must diverge (one request writes new tokens), copy it so the other request's view is unaffected.
  • Reference count (ref_cnt) — how many requests currently use a block. A block is free only when ref_cnt == 0.
  • Eviction — reclaiming a cached (but currently unused) block for a new allocation. vLLM uses an LRU-ish free queue (FreeKVCacheBlockQueue).

Scheduling & batching (Phase 3)

  • Batching — running many requests through the model together to use the GPU efficiently.
  • Static batching — fix a batch, run it to completion. Wasteful: fast requests wait for slow ones.
  • Continuous batching — re-decide the batch every iteration (every single token step). Finished requests leave, new ones join immediately. vLLM's default.
  • Scheduler — the component that, each step, picks which requests run and how many tokens each gets. Code: Scheduler in v1/core/sched/scheduler.py.
  • Chunked prefill — split a long prompt's prefill across several steps so it doesn't starve ongoing decodes. Controlled by a token budget.
  • Token budgetmax_num_batched_tokens: the cap on total tokens scheduled per step.
  • Preemption — when memory runs out, evict a running request's KV and put it back in the queue (to be recomputed later). The safety valve.
  • Running / Waiting queues — requests currently decoding vs. requests waiting to start.

Kernels & execution (Phases 4–7)

  • Kernel — a function that runs on the GPU. "Attention kernel", "GEMM kernel", etc.
  • GEMM — General Matrix-Matrix Multiply. The workhorse op (every linear layer). Libraries: cuBLAS, CUTLASS.
  • FlashAttention — a fused, memory-efficient attention kernel that never materializes the full attention matrix. FlashInfer / FlashMLA / TRTLLM-GEN / Triton — other attention/ GEMM kernel providers vLLM can dispatch to.
  • Attention backend — vLLM's pluggable wrapper choosing which attention kernel to run.
  • CUDA graph — a recorded sequence of GPU operations replayed with one launch, removing per-op CPU launch overhead. Piecewise = capture parts; full = capture the whole model forward.
  • torch.compile — PyTorch's compiler; vLLM uses it to fuse ops and generate kernels, with custom graph passes.
  • MoE (Mixture of Experts) — a layer with many "expert" sub-networks; each token is routed to a few experts. Big models, low active compute. (Mixtral, DeepSeek-V3.)
  • Quantization — storing weights/activations in fewer bits (FP8, INT4, …) to save memory and bandwidth. Formats: FP8, MXFP4, NVFP4, INT8/INT4, GPTQ, AWQ, GGUF, compressed-tensors.

Decoding strategies (Phases 8–9)

  • Greedy decoding — always pick the highest-probability token.
  • Temperature / top-k / top-p / min-p — knobs that shape the sampling distribution.
  • Parallel sampling (n) — produce N independent completions for one prompt (sharing the prompt's KV via prefix caching).
  • Beam search — keep the top-N partial sequences by cumulative probability.
  • Logits processor — a hook that edits the logits before sampling (penalties, bans, grammar masks).
  • Speculative decoding — a cheap draft model/heuristic proposes several tokens; the big model verifies them in one pass, accepting a prefix. Speeds up decode.
  • EAGLE / Medusa / n-gram / suffix / DFlash — specific speculative-decoding methods.
  • Acceptance rate — fraction of drafted tokens the target model accepts. The metric that decides whether spec decode is a win.

Distributed & serving (Phases 10, 15–16)

  • Tensor parallelism (TP) — split each layer's weights across GPUs; every GPU does part of every layer; results all-reduced.
  • Pipeline parallelism (PP) — split the layers across GPUs; activations pass GPU→GPU.
  • Data parallelism (DP) — replicate the model; split requests across replicas.
  • Expert parallelism (EP) — split MoE experts across GPUs.
  • Context parallelism (CP) — split a single sequence's context across GPUs.
  • Collective op — multi-GPU communication primitive (all-reduce, all-gather, …) via NCCL.
  • Disaggregated serving — run prefill and decode on different machines, shipping the KV cache between them, so each can be scaled and tuned independently.
  • KV connector — the component that transfers KV blocks between engines (for P/D disagg or offloading). Code under vllm/distributed/kv_transfer/.
  • OpenAI-compatible server — vLLM's HTTP server speaking the OpenAI API (plus Anthropic Messages API and gRPC).
  • Tool calling / reasoning parser — components that extract structured tool calls or chain-of-thought from model output.

Adaptation & structure (Phases 11–12)

  • LoRA (Low-Rank Adaptation) — small trainable matrices added to a frozen base model to specialize it. vLLM serves many LoRAs in one batch.
  • Punica / SGMV — batched kernels that apply different LoRAs to different requests in one GPU call.
  • Structured output / guided decoding — forcing the model's output to match a grammar, regex, or JSON schema by masking invalid tokens each step. Engines: xgrammar, guidance.

vLLM internals & process model

  • V1 engine — vLLM's current core architecture (the vllm/v1/ tree). V0 is legacy. This course teaches V1.
  • LLM — the offline (batch) Python entry point: LLM(model=...).generate(prompts).
  • AsyncLLM — the async engine powering the API server.
  • EngineCore — the inner loop: add_requeststep() (schedule → execute → output).
  • Worker / Executor — the executor owns workers; each worker drives one GPU's model.
  • Model runner — turns a SchedulerOutput into actual tensor inputs and runs the model.
  • SamplingParams — per-request decoding config (temperature, max_tokens, n, …).
  • RequestOutput — what the engine returns: generated text/tokens for a request.