Glossary

Every term used in this course, defined once, in plain language. When a phase uses a term, it links here. If you ever feel lost, this is the place to land.

Ordering is roughly conceptual, grouped by theme, not alphabetical — read top to bottom the first time, then use Ctrl-F.

The model and the forward pass
Attention and the KV cache
PagedAttention & memory (Phase 2)
Scheduling & batching (Phase 3)
Kernels & execution (Phases 4–7)
Decoding strategies (Phases 8–9)
Distributed & serving (Phases 10, 15–16)
Adaptation & structure (Phases 11–12)
vLLM internals & process model

The model and the forward pass

Token — a chunk of text (often ~¾ of a word) that the model reads and writes. Text is turned into a list of integer token IDs by a tokenizer.
Embedding — the vector the model uses to represent a token internally.
Forward pass — running the model once over some tokens to produce, for the last position, a probability distribution over the next token (the logits).
Logits — the raw, pre-softmax scores over the whole vocabulary for "what comes next".
Autoregressive generation — generate one token, append it to the input, run the forward pass again, repeat. LLMs generate text one token at a time this way.
Decoder-only model — the architecture of GPT/Llama/Qwen: a stack of transformer blocks that only attend to earlier tokens (causal). Most LLMs.

Attention and the KV cache

Attention — the operation where each token "looks at" previous tokens and mixes in their information. For each token it computes a Query (Q), and compares it against the Key (K) and Value (V) of every earlier token.
KV cache — because earlier tokens don't change, their K and V vectors can be computed once and cached. The KV cache is the stored K and V for every token generated so far. It is the single biggest consumer of GPU memory during serving. This course is largely the story of managing it well.
Prefill — the first forward pass over the whole prompt at once. Compute-bound (lots of tokens, one pass). Fills the KV cache for the prompt.
Decode — each subsequent single-token forward pass. Memory-bandwidth-bound (one token, must read all weights + the whole KV cache). This is where most serving time goes.
TTFT (time to first token) — latency from request arrival to the first output token. Dominated by prefill.
ITL / TPOT (inter-token latency / time per output token) — time between successive output tokens. Dominated by decode.

PagedAttention & memory (Phase 2)

PagedAttention — vLLM's core idea: store the KV cache in fixed-size blocks (like OS memory pages) instead of one big contiguous buffer per request. Eliminates fragmentation and enables sharing.
Block (KV block) — a fixed-size slot holding the KV of block_size tokens (commonly 16). The unit of KV allocation. Code: KVCacheBlock in kv_cache_utils.py.
Block size — number of tokens whose KV fits in one block (e.g. 16).
Block table — per-request mapping from logical block index → physical block ID. Lets a request's KV be scattered across non-contiguous physical blocks.
Block pool — the global pool of all physical blocks, with a free list and a prefix-cache index. Code: BlockPool in block_pool.py.
Fragmentation — wasted memory from reserving contiguous space you don't fully use. PagedAttention's reason for existing.
Prefix caching — if two requests share a prefix (same leading tokens), they can share the same physical KV blocks. Found by hashing block contents. Phase 3.
Copy-on-write (CoW) — when a shared block must diverge (one request writes new tokens), copy it so the other request's view is unaffected.
Reference count (ref_cnt) — how many requests currently use a block. A block is free only when ref_cnt == 0.
Eviction — reclaiming a cached (but currently unused) block for a new allocation. vLLM uses an LRU-ish free queue (FreeKVCacheBlockQueue).

Scheduling & batching (Phase 3)

Batching — running many requests through the model together to use the GPU efficiently.
Static batching — fix a batch, run it to completion. Wasteful: fast requests wait for slow ones.
Continuous batching — re-decide the batch every iteration (every single token step). Finished requests leave, new ones join immediately. vLLM's default.
Scheduler — the component that, each step, picks which requests run and how many tokens each gets. Code: Scheduler in v1/core/sched/scheduler.py.
Chunked prefill — split a long prompt's prefill across several steps so it doesn't starve ongoing decodes. Controlled by a token budget.
Token budget — max_num_batched_tokens: the cap on total tokens scheduled per step.
Preemption — when memory runs out, evict a running request's KV and put it back in the queue (to be recomputed later). The safety valve.
Running / Waiting queues — requests currently decoding vs. requests waiting to start.

Kernels & execution (Phases 4–7)

Kernel — a function that runs on the GPU. "Attention kernel", "GEMM kernel", etc.
GEMM — General Matrix-Matrix Multiply. The workhorse op (every linear layer). Libraries: cuBLAS, CUTLASS.
FlashAttention — a fused, memory-efficient attention kernel that never materializes the full attention matrix. FlashInfer / FlashMLA / TRTLLM-GEN / Triton — other attention/ GEMM kernel providers vLLM can dispatch to.
Attention backend — vLLM's pluggable wrapper choosing which attention kernel to run.
CUDA graph — a recorded sequence of GPU operations replayed with one launch, removing per-op CPU launch overhead. Piecewise = capture parts; full = capture the whole model forward.
torch.compile — PyTorch's compiler; vLLM uses it to fuse ops and generate kernels, with custom graph passes.
MoE (Mixture of Experts) — a layer with many "expert" sub-networks; each token is routed to a few experts. Big models, low active compute. (Mixtral, DeepSeek-V3.)
Quantization — storing weights/activations in fewer bits (FP8, INT4, …) to save memory and bandwidth. Formats: FP8, MXFP4, NVFP4, INT8/INT4, GPTQ, AWQ, GGUF, compressed-tensors.

Decoding strategies (Phases 8–9)

Greedy decoding — always pick the highest-probability token.
Temperature / top-k / top-p / min-p — knobs that shape the sampling distribution.
Parallel sampling (n) — produce N independent completions for one prompt (sharing the prompt's KV via prefix caching).
Beam search — keep the top-N partial sequences by cumulative probability.
Logits processor — a hook that edits the logits before sampling (penalties, bans, grammar masks).
Speculative decoding — a cheap draft model/heuristic proposes several tokens; the big model verifies them in one pass, accepting a prefix. Speeds up decode.
EAGLE / Medusa / n-gram / suffix / DFlash — specific speculative-decoding methods.
Acceptance rate — fraction of drafted tokens the target model accepts. The metric that decides whether spec decode is a win.

Distributed & serving (Phases 10, 15–16)

Tensor parallelism (TP) — split each layer's weights across GPUs; every GPU does part of every layer; results all-reduced.
Pipeline parallelism (PP) — split the layers across GPUs; activations pass GPU→GPU.
Data parallelism (DP) — replicate the model; split requests across replicas.
Expert parallelism (EP) — split MoE experts across GPUs.
Context parallelism (CP) — split a single sequence's context across GPUs.
Collective op — multi-GPU communication primitive (all-reduce, all-gather, …) via NCCL.
Disaggregated serving — run prefill and decode on different machines, shipping the KV cache between them, so each can be scaled and tuned independently.
KV connector — the component that transfers KV blocks between engines (for P/D disagg or offloading). Code under vllm/distributed/kv_transfer/.
OpenAI-compatible server — vLLM's HTTP server speaking the OpenAI API (plus Anthropic Messages API and gRPC).
Tool calling / reasoning parser — components that extract structured tool calls or chain-of-thought from model output.

Adaptation & structure (Phases 11–12)

LoRA (Low-Rank Adaptation) — small trainable matrices added to a frozen base model to specialize it. vLLM serves many LoRAs in one batch.
Punica / SGMV — batched kernels that apply different LoRAs to different requests in one GPU call.
Structured output / guided decoding — forcing the model's output to match a grammar, regex, or JSON schema by masking invalid tokens each step. Engines: xgrammar, guidance.

vLLM internals & process model

V1 engine — vLLM's current core architecture (the vllm/v1/ tree). V0 is legacy. This course teaches V1.
LLM — the offline (batch) Python entry point: LLM(model=...).generate(prompts).
AsyncLLM — the async engine powering the API server.
EngineCore — the inner loop: add_request → step() (schedule → execute → output).
Worker / Executor — the executor owns workers; each worker drives one GPU's model.
Model runner — turns a SchedulerOutput into actual tensor inputs and runs the model.
SamplingParams — per-request decoding config (temperature, max_tokens, n, …).
RequestOutput — what the engine returns: generated text/tokens for a request.

vLLM Mastery — From Zero to Maintainer

Glossary

Contents

The model and the forward pass

Attention and the KV cache

PagedAttention & memory (Phase 2)

Scheduling & batching (Phase 3)

Kernels & execution (Phases 4–7)

Decoding strategies (Phases 8–9)

Distributed & serving (Phases 10, 15–16)

Adaptation & structure (Phases 11–12)

vLLM internals & process model

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer