Glossary
Every term used in this course, defined once, in plain language. When a phase uses a term, it links here. If you ever feel lost, this is the place to land.
Ordering is roughly conceptual, grouped by theme, not alphabetical — read top to bottom the first time, then use Ctrl-F.
Contents
- The model and the forward pass
- Attention and the KV cache
- PagedAttention & memory (Phase 2)
- Scheduling & batching (Phase 3)
- Kernels & execution (Phases 4–7)
- Decoding strategies (Phases 8–9)
- Distributed & serving (Phases 10, 15–16)
- Adaptation & structure (Phases 11–12)
- vLLM internals & process model
The model and the forward pass
- Token — a chunk of text (often ~¾ of a word) that the model reads and writes. Text is turned into a list of integer token IDs by a tokenizer.
- Embedding — the vector the model uses to represent a token internally.
- Forward pass — running the model once over some tokens to produce, for the last position, a probability distribution over the next token (the logits).
- Logits — the raw, pre-softmax scores over the whole vocabulary for "what comes next".
- Autoregressive generation — generate one token, append it to the input, run the forward pass again, repeat. LLMs generate text one token at a time this way.
- Decoder-only model — the architecture of GPT/Llama/Qwen: a stack of transformer blocks that only attend to earlier tokens (causal). Most LLMs.
Attention and the KV cache
- Attention — the operation where each token "looks at" previous tokens and mixes in their information. For each token it computes a Query (Q), and compares it against the Key (K) and Value (V) of every earlier token.
- KV cache — because earlier tokens don't change, their K and V vectors can be computed once and cached. The KV cache is the stored K and V for every token generated so far. It is the single biggest consumer of GPU memory during serving. This course is largely the story of managing it well.
- Prefill — the first forward pass over the whole prompt at once. Compute-bound (lots of tokens, one pass). Fills the KV cache for the prompt.
- Decode — each subsequent single-token forward pass. Memory-bandwidth-bound (one token, must read all weights + the whole KV cache). This is where most serving time goes.
- TTFT (time to first token) — latency from request arrival to the first output token. Dominated by prefill.
- ITL / TPOT (inter-token latency / time per output token) — time between successive output tokens. Dominated by decode.
PagedAttention & memory (Phase 2)
- PagedAttention — vLLM's core idea: store the KV cache in fixed-size blocks (like OS memory pages) instead of one big contiguous buffer per request. Eliminates fragmentation and enables sharing.
- Block (KV block) — a fixed-size slot holding the KV of
block_sizetokens (commonly 16). The unit of KV allocation. Code:KVCacheBlockinkv_cache_utils.py. - Block size — number of tokens whose KV fits in one block (e.g. 16).
- Block table — per-request mapping from logical block index → physical block ID. Lets a request's KV be scattered across non-contiguous physical blocks.
- Block pool — the global pool of all physical blocks, with a free list and a prefix-cache
index. Code:
BlockPoolinblock_pool.py. - Fragmentation — wasted memory from reserving contiguous space you don't fully use. PagedAttention's reason for existing.
- Prefix caching — if two requests share a prefix (same leading tokens), they can share the same physical KV blocks. Found by hashing block contents. Phase 3.
- Copy-on-write (CoW) — when a shared block must diverge (one request writes new tokens), copy it so the other request's view is unaffected.
- Reference count (
ref_cnt) — how many requests currently use a block. A block is free only whenref_cnt == 0. - Eviction — reclaiming a cached (but currently unused) block for a new allocation. vLLM
uses an LRU-ish free queue (
FreeKVCacheBlockQueue).
Scheduling & batching (Phase 3)
- Batching — running many requests through the model together to use the GPU efficiently.
- Static batching — fix a batch, run it to completion. Wasteful: fast requests wait for slow ones.
- Continuous batching — re-decide the batch every iteration (every single token step). Finished requests leave, new ones join immediately. vLLM's default.
- Scheduler — the component that, each step, picks which requests run and how many tokens
each gets. Code:
Schedulerinv1/core/sched/scheduler.py. - Chunked prefill — split a long prompt's prefill across several steps so it doesn't starve ongoing decodes. Controlled by a token budget.
- Token budget —
max_num_batched_tokens: the cap on total tokens scheduled per step. - Preemption — when memory runs out, evict a running request's KV and put it back in the queue (to be recomputed later). The safety valve.
- Running / Waiting queues — requests currently decoding vs. requests waiting to start.
Kernels & execution (Phases 4–7)
- Kernel — a function that runs on the GPU. "Attention kernel", "GEMM kernel", etc.
- GEMM — General Matrix-Matrix Multiply. The workhorse op (every linear layer). Libraries: cuBLAS, CUTLASS.
- FlashAttention — a fused, memory-efficient attention kernel that never materializes the full attention matrix. FlashInfer / FlashMLA / TRTLLM-GEN / Triton — other attention/ GEMM kernel providers vLLM can dispatch to.
- Attention backend — vLLM's pluggable wrapper choosing which attention kernel to run.
- CUDA graph — a recorded sequence of GPU operations replayed with one launch, removing per-op CPU launch overhead. Piecewise = capture parts; full = capture the whole model forward.
- torch.compile — PyTorch's compiler; vLLM uses it to fuse ops and generate kernels, with custom graph passes.
- MoE (Mixture of Experts) — a layer with many "expert" sub-networks; each token is routed to a few experts. Big models, low active compute. (Mixtral, DeepSeek-V3.)
- Quantization — storing weights/activations in fewer bits (FP8, INT4, …) to save memory and bandwidth. Formats: FP8, MXFP4, NVFP4, INT8/INT4, GPTQ, AWQ, GGUF, compressed-tensors.
Decoding strategies (Phases 8–9)
- Greedy decoding — always pick the highest-probability token.
- Temperature / top-k / top-p / min-p — knobs that shape the sampling distribution.
- Parallel sampling (
n) — produce N independent completions for one prompt (sharing the prompt's KV via prefix caching). - Beam search — keep the top-N partial sequences by cumulative probability.
- Logits processor — a hook that edits the logits before sampling (penalties, bans, grammar masks).
- Speculative decoding — a cheap draft model/heuristic proposes several tokens; the big model verifies them in one pass, accepting a prefix. Speeds up decode.
- EAGLE / Medusa / n-gram / suffix / DFlash — specific speculative-decoding methods.
- Acceptance rate — fraction of drafted tokens the target model accepts. The metric that decides whether spec decode is a win.
Distributed & serving (Phases 10, 15–16)
- Tensor parallelism (TP) — split each layer's weights across GPUs; every GPU does part of every layer; results all-reduced.
- Pipeline parallelism (PP) — split the layers across GPUs; activations pass GPU→GPU.
- Data parallelism (DP) — replicate the model; split requests across replicas.
- Expert parallelism (EP) — split MoE experts across GPUs.
- Context parallelism (CP) — split a single sequence's context across GPUs.
- Collective op — multi-GPU communication primitive (all-reduce, all-gather, …) via NCCL.
- Disaggregated serving — run prefill and decode on different machines, shipping the KV cache between them, so each can be scaled and tuned independently.
- KV connector — the component that transfers KV blocks between engines (for P/D disagg or
offloading). Code under
vllm/distributed/kv_transfer/. - OpenAI-compatible server — vLLM's HTTP server speaking the OpenAI API (plus Anthropic Messages API and gRPC).
- Tool calling / reasoning parser — components that extract structured tool calls or chain-of-thought from model output.
Adaptation & structure (Phases 11–12)
- LoRA (Low-Rank Adaptation) — small trainable matrices added to a frozen base model to specialize it. vLLM serves many LoRAs in one batch.
- Punica / SGMV — batched kernels that apply different LoRAs to different requests in one GPU call.
- Structured output / guided decoding — forcing the model's output to match a grammar, regex, or JSON schema by masking invalid tokens each step. Engines: xgrammar, guidance.
vLLM internals & process model
- V1 engine — vLLM's current core architecture (the
vllm/v1/tree). V0 is legacy. This course teaches V1. LLM— the offline (batch) Python entry point:LLM(model=...).generate(prompts).AsyncLLM— the async engine powering the API server.EngineCore— the inner loop:add_request→step()(schedule → execute → output).- Worker / Executor — the executor owns workers; each worker drives one GPU's model.
- Model runner — turns a
SchedulerOutputinto actual tensor inputs and runs the model. SamplingParams— per-request decoding config (temperature, max_tokens, n, …).RequestOutput— what the engine returns: generated text/tokens for a request.