Phase 00 — Deep Dive: a real forward pass and the request counters
Paths relative to
upstream/atv0.22.1 @ 0decac0. You don't need to understand every line of a model — you need to recognize the shapes from the guide (Q/K/V, the KV cache, the prefill/decode counters) in real code. That recognition is what lets you navigate any model file later (Phase 14).
Contents
- 1. A real decoder-only model: Llama
- 2. The two counters that run the whole engine
- 3. The loop that drives it all:
EngineCore.step - Reading checklist
1. A real decoder-only model: Llama
Open vllm/model_executor/models/llama.py. The structure is a Russian doll:
LlamaModel(:350) — holds the embedding + a stack ofLlamaDecoderLayers + a final norm.LlamaDecoderLayer(:253) — one transformer block:self_attnthenmlp, each with a residual add and an RMSNorm.LlamaAttention(:124) — the attention block.LlamaMLP(:the small class withforward(self, x)at:117) — gate/up/down projections.
The decoder layer forward (LlamaDecoderLayer.forward, :316)
Skim it and find this shape (paraphrased):
# residual stream in -> norm -> attention -> add -> norm -> mlp -> add -> out
hidden = self.input_layernorm(hidden_states)
hidden = self.self_attn(positions, hidden) # attention mixes across tokens
hidden = residual + hidden
hidden = self.post_attention_layernorm(hidden)
hidden = self.mlp(hidden) # per-token transform
hidden = residual + hidden
That's the whole transformer block. 32 of these stacked = Llama-3-8B. Notice attention is the only place tokens interact; the MLP treats each token independently. That's why attention is where the KV cache (cross-token memory) lives, and the MLP is just big GEMMs (Phase 7).
Where K and V are produced and cached (LlamaAttention.forward, :223)
This is the payoff. Find (paraphrased):
qkv, _ = self.qkv_proj(hidden_states) # one matmul produces Q, K, V (fused)
q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
q, k = self.rotary_emb(positions, q, k) # positional info (RoPE)
attn_output = self.attn(q, k, v) # <- the Attention layer (Phase 4)
output, _ = self.o_proj(attn_output)
The self.attn call is a vllm.attention.layer.Attention module — and that is what writes the
new k, v into the paged KV cache (Phase 2) and reads back the cached K/V to compute
attention (Phase 4). So the journey is: model produces Q/K/V → the Attention layer caches K/V in
blocks and runs the attention kernel. Everything you'll learn in Phases 2 and 4 plugs in right
here, at this one self.attn(q, k, v) call. Hold that thread.
Don't get lost. You will not understand all of
llama.pytoday, and you don't need to. The point is to locate Q/K/V production and theself.attncall. That's the seam where the engine's memory and kernels meet the model.
2. The two counters that run the whole engine
Open vllm/v1/request.py. The Request class (:59) carries the prompt, the generated tokens,
and the sampling params. The two properties that matter most:
@property
def num_tokens(self) -> int: # :239
# total tokens that exist: prompt + generated so far
...
and the field set in __init__ (:145): self.num_computed_tokens = 0 — how many of those
tokens have had their KV computed and cached.
The whole engine is the race between these two numbers (guide §"mental model"):
- New request:
num_computed_tokens = 0,num_tokens = len(prompt). The gap is the whole prompt → prefill. - After prefill + each decode:
num_computed_tokensis one behindnum_tokens; generating a token bumpsnum_tokens, then the next step computes one more → decode.
num_tokens_with_spec (:243) adds speculative draft tokens to the gap — which is how spec
decode (Phase 8) rides the same machinery with no special case. RequestStatus (:315) is the
lifecycle enum (WAITING/RUNNING/PREEMPTED/FINISHED_*) you met in Phase 3.
mini_vllm/request.py is a faithful miniature: same num_computed_tokens vs num_tokens, same
status enum, same is_finished = status >= FINISHED ordering trick.
3. The loop that drives it all: EngineCore.step
Open vllm/v1/engine/core.py:428. This is the heartbeat of vLLM:
def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]:
if not self.scheduler.has_requests():
return {}, False
scheduler_output = self.scheduler.schedule() # Phase 3: who runs
future = self.model_executor.execute_model(scheduler_output, ...) # the forward pass
grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output) # Phase 12
model_output = future.result()
if model_output is None:
model_output = self.model_executor.sample_tokens(grammar_output) # Phase 9
engine_core_outputs = self.scheduler.update_from_output( # advance counters
scheduler_output, model_output)
return engine_core_outputs, scheduler_output.total_num_scheduled_tokens > 0
schedule → execute → sample → update. That's it. That's the engine. Every phase in this course is a deep dive into one box of this five-line loop:
schedule()→ Phases 2, 3 (memory + batching)execute_model()→ Phases 4–7, 10, 13, 14 (kernels, quant, parallelism, the model itself)sample_tokens()→ Phases 8, 9, 12 (decoding, spec, structured output)update_from_output()→ Phase 3 (advancenum_computed_tokens, reap finished)
mini_vllm/engine.py's step() is the same loop with the GPU filed off — read them side by
side and the correspondence is exact.
Reading checklist
One sentence each:
-
In
LlamaAttention.forward, which line produces Q/K/V and which line caches/uses K/V? - Why does the MLP not need a KV cache but attention does?
-
On
Request, what's the difference betweennum_tokensandnum_computed_tokens? -
In
EngineCore.step, name the four stages and which course phase owns each.
Now build it: 02-mini-build.md, then the labs.