Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 00 — Deep Dive: a real forward pass and the request counters

Paths relative to upstream/ at v0.22.1 @ 0decac0. You don't need to understand every line of a model — you need to recognize the shapes from the guide (Q/K/V, the KV cache, the prefill/decode counters) in real code. That recognition is what lets you navigate any model file later (Phase 14).

Contents


1. A real decoder-only model: Llama

Open vllm/model_executor/models/llama.py. The structure is a Russian doll:

  • LlamaModel (:350) — holds the embedding + a stack of LlamaDecoderLayers + a final norm.
  • LlamaDecoderLayer (:253) — one transformer block: self_attn then mlp, each with a residual add and an RMSNorm.
  • LlamaAttention (:124) — the attention block.
  • LlamaMLP (: the small class with forward(self, x) at :117) — gate/up/down projections.

The decoder layer forward (LlamaDecoderLayer.forward, :316)

Skim it and find this shape (paraphrased):

# residual stream in -> norm -> attention -> add -> norm -> mlp -> add -> out
hidden = self.input_layernorm(hidden_states)
hidden = self.self_attn(positions, hidden)      # attention mixes across tokens
hidden = residual + hidden
hidden = self.post_attention_layernorm(hidden)
hidden = self.mlp(hidden)                        # per-token transform
hidden = residual + hidden

That's the whole transformer block. 32 of these stacked = Llama-3-8B. Notice attention is the only place tokens interact; the MLP treats each token independently. That's why attention is where the KV cache (cross-token memory) lives, and the MLP is just big GEMMs (Phase 7).

Where K and V are produced and cached (LlamaAttention.forward, :223)

This is the payoff. Find (paraphrased):

qkv, _ = self.qkv_proj(hidden_states)            # one matmul produces Q, K, V (fused)
q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
q, k = self.rotary_emb(positions, q, k)          # positional info (RoPE)
attn_output = self.attn(q, k, v)                 # <- the Attention layer (Phase 4)
output, _ = self.o_proj(attn_output)

The self.attn call is a vllm.attention.layer.Attention module — and that is what writes the new k, v into the paged KV cache (Phase 2) and reads back the cached K/V to compute attention (Phase 4). So the journey is: model produces Q/K/V → the Attention layer caches K/V in blocks and runs the attention kernel. Everything you'll learn in Phases 2 and 4 plugs in right here, at this one self.attn(q, k, v) call. Hold that thread.

Don't get lost. You will not understand all of llama.py today, and you don't need to. The point is to locate Q/K/V production and the self.attn call. That's the seam where the engine's memory and kernels meet the model.


2. The two counters that run the whole engine

Open vllm/v1/request.py. The Request class (:59) carries the prompt, the generated tokens, and the sampling params. The two properties that matter most:

@property
def num_tokens(self) -> int:           # :239
    # total tokens that exist: prompt + generated so far
    ...

and the field set in __init__ (:145): self.num_computed_tokens = 0 — how many of those tokens have had their KV computed and cached.

The whole engine is the race between these two numbers (guide §"mental model"):

  • New request: num_computed_tokens = 0, num_tokens = len(prompt). The gap is the whole prompt → prefill.
  • After prefill + each decode: num_computed_tokens is one behind num_tokens; generating a token bumps num_tokens, then the next step computes one more → decode.

num_tokens_with_spec (:243) adds speculative draft tokens to the gap — which is how spec decode (Phase 8) rides the same machinery with no special case. RequestStatus (:315) is the lifecycle enum (WAITING/RUNNING/PREEMPTED/FINISHED_*) you met in Phase 3.

mini_vllm/request.py is a faithful miniature: same num_computed_tokens vs num_tokens, same status enum, same is_finished = status >= FINISHED ordering trick.


3. The loop that drives it all: EngineCore.step

Open vllm/v1/engine/core.py:428. This is the heartbeat of vLLM:

def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]:
    if not self.scheduler.has_requests():
        return {}, False
    scheduler_output = self.scheduler.schedule()                       # Phase 3: who runs
    future = self.model_executor.execute_model(scheduler_output, ...)  # the forward pass
    grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)  # Phase 12
    model_output = future.result()
    if model_output is None:
        model_output = self.model_executor.sample_tokens(grammar_output)   # Phase 9
    engine_core_outputs = self.scheduler.update_from_output(           # advance counters
        scheduler_output, model_output)
    return engine_core_outputs, scheduler_output.total_num_scheduled_tokens > 0

schedule → execute → sample → update. That's it. That's the engine. Every phase in this course is a deep dive into one box of this five-line loop:

  • schedule() → Phases 2, 3 (memory + batching)
  • execute_model() → Phases 4–7, 10, 13, 14 (kernels, quant, parallelism, the model itself)
  • sample_tokens() → Phases 8, 9, 12 (decoding, spec, structured output)
  • update_from_output() → Phase 3 (advance num_computed_tokens, reap finished)

mini_vllm/engine.py's step() is the same loop with the GPU filed off — read them side by side and the correspondence is exact.


Reading checklist

One sentence each:

  • In LlamaAttention.forward, which line produces Q/K/V and which line caches/uses K/V?
  • Why does the MLP not need a KV cache but attention does?
  • On Request, what's the difference between num_tokens and num_computed_tokens?
  • In EngineCore.step, name the four stages and which course phase owns each.

Now build it: 02-mini-build.md, then the labs.