Phase 01 — The Hitchhiker's Guide to vLLM's Architecture & Request Lifecycle
← Phase 00 · Course home · Phase 02 →
This is Chapter 1. Phase 0 taught you what one forward pass is and why it's slow. This chapter zooms out to the whole machine: how vLLM turns that single forward pass into a service that streams tokens to thousands of users at once. We build the architecture the way you'd design it yourself — by starting with the obvious naive server, watching it fail, and fixing each failure. By the end you'll be able to trace any request from an HTTP call to a streamed token and name every component it touches. That mental map is what lets you navigate 500,000 lines of code without drowning.
How to read this chapter. Everyday explanation throughout; paragraphs marked 🔬 Going deeper add the systems rigor for the expert track.
Contents
- 1.1 Don't Panic — the architecture in one breath
- 1.2 Let's design it ourselves: why the naive server fails
- 1.3 Two front doors, one engine
- 1.4 The objects a request becomes (and why each exists)
- 1.5 The heartbeat dissected:
EngineCore.step() - 1.6 The process architecture: why the engine lives alone
- 1.7 Who actually touches the GPU: Executor → Worker → ModelRunner
- 1.8 The request lifecycle: a state machine
- 1.9 How thousands of requests share the loop (the payoff)
- 1.10 Tracing one request, end to end
- 1.11 The mental model to carry forward
- 1.12 What you'll do in this phase
1.1 Don't Panic — the architecture in one breath
A request enters as a string and leaves as tokens. In between it passes through a handful of well-named components, and a tiny loop runs the model over and over until the request is done.
vLLM looks enormous, but the path a request takes is short:
"Tell me a joke"
│ tokenize, wrap as a request
▼
LLM / AsyncLLM ← the front door (offline batch / online server)
│ add_request
▼
EngineCore.step() ──────────────── the heartbeat (runs every ~10–50 ms) ───────────┐
│ 1. schedule() who runs this step, and how many tokens (Ph 3) │
│ 2. execute_model() run the model on the assembled batch (Ph 4+) │ loop
│ 3. sample_tokens() pick the next token for each sequence (Ph 9)│ until
│ 4. update_from_output() advance counters, retire finished reqs (Ph 3)│ done
▼ │
Detokenizer / OutputProcessor ← token IDs → text, streamed back ───────────────────┘
│
▼
" Why did the function..." (streamed token by token)
That five-line step() loop is vLLM. Every later phase is a deep dive into one box of it. The
rest of this chapter explains why it's shaped this way and what each piece does.
1.2 Let's design it ourselves: why the naive server fails
The fastest way to understand vLLM's architecture is to build the obvious version in your head and watch it break. Each break motivates a real component.
Attempt 1 — a function call. "Just call model.generate(prompt) per request." This works for
one user. But it serves requests one at a time: while user A's 500-token answer generates, users
B–Z wait. And from Phase 0 §0.10, a single decode stream uses ~1% of the GPU (it's memory-bound at
batch 1). You're paying for a Ferrari and driving it in a parking lot. → We must run many requests
together (batching).
Attempt 2 — static batching. "Collect N requests, run them as a batch until all finish." Better GPU use, but two new problems:
- Requests have different lengths. A batch finishes at the speed of its slowest member; short requests sit idle in the batch, wasting their slot. (We'll fix this with continuous batching — re-decide the batch every step — in Phase 3.)
- New requests that arrive mid-batch must wait for the whole batch to finish before they can start. Terrible tail latency. → We need a component that re-plans the batch every single step: the scheduler.
Attempt 3 — scheduler + model, in one Python process, inside the web server. Now the GPU loop and the HTTP server share a process. Problems:
- The tight, latency-critical GPU loop competes with HTTP parsing, JSON serialization, and detokenization for the single Python thread (the GIL). A burst of requests stalls the GPU.
- Multi-GPU (Phase 10) needs multiple processes anyway. → Isolate the engine in its own process, talk to it over a queue. That's vLLM's V1 design.
So the architecture isn't arbitrary — each component is the answer to a specific failure of the naive version. Now let's name the real pieces.
🆕 New words: batching (run many requests together), static vs continuous batching, scheduler (re-plans the batch each step), the GIL (Python's single-thread lock — why the engine gets its own process).
1.3 Two front doors, one engine
vLLM has two entry points, and the crucial insight is that both are thin shells over the same engine core:
- Offline / batch:
LLM(model=...).generate(prompts)—vllm/entrypoints/llm.py. You hand it a list of prompts; it returns a list of results when all are done. Synchronous. This is whatmini_vllm'sLLMEngine.generatemirrors, and what you use in scripts and evals. - Online / serving: an HTTP server (OpenAI-compatible, Phase 16) →
AsyncLLM(vllm/v1/engine/async_llm.py) → the same core, but async and streaming — it yields each token as it's produced so the user sees text appear live.
Both funnel into EngineCore (vllm/v1/engine/core.py). Internalize this: batch and server
are skins; the engine is one. When you fix something in the core, you fix it for both.
1.4 The objects a request becomes (and why each exists)
A request changes form as it travels — and each form is a deliberate data type. Knowing them means that when you read a stack trace, you instantly know which stage you're in by the type in hand.
| Object | Lives between | Carries | Why it exists |
|---|---|---|---|
prompt + SamplingParams | user → server | the text + decoding knobs (temperature, max_tokens, n, stop) | the user's intent |
EngineCoreRequest | input proc → core | tokenized prompt + params + a request id | a serializable unit to cross the process boundary |
Request | inside the scheduler | the live request: token ids, num_computed_tokens / num_tokens, status, block table | the engine's working state (Phase 0's two counters!) |
SchedulerOutput | scheduler → executor | who runs, how many tokens each, block tables, etc. | the per-step plan |
ModelRunnerOutput | executor → core | sampled token ids, logprobs | the model's result |
RequestOutput | core → user | generated text/tokens (a delta, when streaming) | what the caller receives |
🔬 Going deeper. The split between
EngineCoreRequest(crosses the process boundary, so it's a plain serializable struct) andRequest(rich, mutable, lives only inside the engine process) is not incidental — it's the seam where the IPC boundary sits (§1.6). AndRequestOutputbeing a delta in streaming mode (only the new tokens since last time) is what makes server-sent-events streaming cheap. Naming is half of understanding a system; learn these six.
🆕 New words:
SamplingParams,EngineCoreRequest,Request,SchedulerOutput,ModelRunnerOutput,RequestOutput.
1.5 The heartbeat dissected: EngineCore.step()
The engine is a loop. Each tick (step()) advances every in-flight request by some tokens. Here
is the loop with each stage explained — this is the spine of the whole system:
def step():
scheduler_output = self.scheduler.schedule() # 1. PLAN
model_output = self.model_executor.execute_model(...) # 2. RUN
# (sampling happens inside/after execute; shown separate for clarity)
sampled = self.model_executor.sample_tokens(...) # 3. PICK
outputs = self.scheduler.update_from_output(...) # 4. BOOKKEEP
return outputs
- Schedule (the plan) — the scheduler looks at every waiting and running request and the free
KV memory, and decides: who runs this step, and how many tokens does each get? This is where
continuous batching, chunked prefill, prefix caching, and preemption happen (Phases 2–3). Output:
a
SchedulerOutput. - Execute (run the model) — the executor turns that plan into actual tensors (gather the scheduled tokens, build the attention metadata — block tables and sequence lengths from Phases 2–4) and runs the forward pass on the GPU (possibly as a CUDA graph, Phase 5). This is where kernels, quantization, parallelism, and the model itself live (Phases 4–7, 10, 13, 14).
- Sample (pick tokens) — turn the model's logits into one new token per sequence, applying each request's own sampling params, grammar masks, etc. (Phases 8, 9, 12).
- Bookkeep (update) — append the sampled tokens, advance each request's
num_computed_tokens, detect which requests just finished (hit EOS or max length), free their KV blocks, and emit outputs (Phase 3).
Then it loops. A request might be touched by a few hundred ticks over its lifetime (one per output token, after prefill). Every box of this loop maps to a phase of the course — keep this diagram open as your table of contents.
🔬 Going deeper — the real
stepis even leaner. Incore.pythe four stages are visible almost verbatim (you'll read them in the deep-dive). Two production wrinkles: (a)execute_modelcan run asynchronously (return a future) so the scheduler can plan the next step while the GPU works on this one — overlapping CPU and GPU; (b) a grammar bitmask for structured output (Phase 12) is computed between schedule and sample. Don't let those obscure the four-beat rhythm: plan → run → pick → bookkeep.
1.6 The process architecture: why the engine lives alone
From §1.2, the engine must not share a Python thread with the web server. So V1 runs EngineCore
in its own process (EngineCoreProc). The picture:
┌─────────────── API server process ───────────────┐ ┌──── EngineCore process ────┐
│ HTTP / OpenAI endpoints (Phase 16) │ │ scheduler │
│ tokenization, request validation │ IPC │ the model + KV cache │
│ AsyncLLM ── EngineCoreRequest ──────────────────┼───────▶│ step() loop │
│ detokenization, streaming ◀── EngineCoreOutputs─┼────────┤ │
└───────────────────────────────────────────────────┘ └────────────────────────────┘
Why this split is worth a whole process boundary:
- The scheduling loop stays tight — no HTTP work, JSON, or detokenization steals its thread or contends for the GIL. The GPU is never starved by web-server bookkeeping.
- Detokenization and streaming run on the server side, off the engine's hot path — turning token
IDs back into text and formatting SSE chunks happens in parallel with the next
step(). - It generalizes to multi-GPU: the core process becomes the coordinator of worker processes (next section).
The cost is that requests and outputs must be serialized across the boundary (that's why
EngineCoreRequest/EngineCoreOutputs are plain structs, §1.4). It's a price worth paying for an
uninterrupted GPU loop.
🆕 New words: IPC (inter-process communication),
EngineCoreProc(the engine's own process), SSE (server-sent events — the streaming protocol).
1.7 Who actually touches the GPU: Executor → Worker → ModelRunner
EngineCore decides what to run; it does not run the model itself. That's delegated down a chain
whose whole purpose is to make the same engine run on 1 GPU or 64:
EngineCore
└─ Executor (vllm/v1/executor/) owns the worker(s); the engine's handle to compute
└─ Worker (vllm/v1/worker/gpu_worker.py) one per GPU: holds a model shard + its KV cache
└─ ModelRunner (gpu_model_runner.py) SchedulerOutput → input tensors → forward → sampler
- Executor — for a single GPU it's a
UniProcExecutor(just calls the one worker). For tensor/pipeline parallelism (Phase 10) it's aMultiprocExecutorthat owns N worker processes and broadcasts each step's plan to all of them. - Worker — owns one GPU: its device, its slice of the model's weights, and its slice of the KV cache. Runs in lockstep with its peers.
- ModelRunner — the busiest object in the engine. It takes the
SchedulerOutput, prepares the input tensors (gathers the scheduled tokens, builds the attention metadata: block tables + sequence lengths + slot mapping — Phases 2/4), runs the (possibly CUDA-graphed) forward pass, and runs the sampler. You'll return togpu_model_runner.pyin Phases 4, 5, 9, 13.
The elegance: the model code is identical whether you run on 1 GPU or 64 — it just uses parallel layers, and the Executor fans the work out. Scaling out changes the Executor, nothing above it.
🔬 Going deeper. This is also where the prepare-inputs cost lives — assembling ragged, variable-length batches into padded tensors and metadata every step is real CPU work, and at small batch it can rival the GPU time. That's a major reason CUDA graphs (Phase 5) and careful tensor reuse matter, and why
gpu_model_runner.pyis so heavily optimized. When you profile a slow deployment (Phase 18), this file is a frequent suspect.
1.8 The request lifecycle: a state machine
Inside the engine, each request moves through a small set of states (RequestStatus in
vllm/v1/request.py). Understanding the states — and especially the transitions — is how you
reason about latency, fairness, and failures.
(arrives)
│
▼
┌─────────┐ admitted by ┌─────────┐ generates a token ┌──────────────────┐
│ WAITING │ ───scheduler────▶ │ RUNNING │ ───each step────────▶ │ FINISHED_* │
└─────────┘ (KV allocated) └─────────┘ until stop/maxlen │ (STOPPED/LENGTH/ │
▲ │ │ ABORTED/ERROR) │
│ preempted: out of KV │ └──────────────────┘
└────────── PREEMPTED ◀──────┘ (KV freed; re-admitted later, recomputed — Phase 3)
- WAITING — admitted to the engine, queued, not yet running (no KV allocated yet).
- RUNNING — actively generating; has KV blocks; touched every step.
- PREEMPTED — was running, but the engine ran out of KV memory and evicted it to make progress on others; it goes back to WAITING and is recomputed when memory frees (Phase 3's safety valve).
- FINISHED_* — terminal: hit a stop token (
STOPPED), hit max length (LENGTH_CAPPED), was cancelled (ABORTED), or errored. Its KV is freed and the final output returned.
🔬 Going deeper. Real vLLM has extra "waiting" sub-states for requests blocked on something other than the queue: waiting for a structured-output grammar to compile (Phase 12), waiting for KV to arrive over the network in disaggregated serving (Phase 15), etc. They're still "not ready to run," just for richer reasons. Also note the enum ordering trick:
is_finishedis simplystatus > PREEMPTED, so the terminal states are defined by position in the enum — a tiny detail that makes the hot-path check branch-free. You'll trace this exact state machine inlab-01.
🆕 New words:
RequestStatus, preemption (evict a running request under memory pressure), terminal/finished states.
1.9 How thousands of requests share the loop (the payoff)
Now connect the architecture back to Phase 0's physics. Why all this machinery? Because of §0.10:
one decode stream wastes ~99% of the GPU. The architecture exists to keep many requests in
flight so each step() decodes a big batch — amortizing the weight read and pushing arithmetic
intensity toward the roofline ridge.
Crucially, because the scheduler re-plans every step (continuous batching, Phase 3), requests
don't move in lockstep: the moment one finishes, its slot is freed and a WAITING request joins
mid-flight, on the very next tick. So at any instant the running batch is a churning mix of
requests at different stages — some doing their first (prefill) step, most adding one decode token.
The loop in §1.5 absorbs all of that uniformly because, to it, every request is just "advance
num_computed_tokens toward num_tokens" (Phase 0 §0.13). That uniformity is why one simple loop
can serve a chaotic, ever-changing crowd.
time ─►
req A [prefill][dec][dec][done]
req B [prefill][dec][dec][dec][done]
req C [prefill][dec][dec]... ← C joined the instant A's slot freed
every column is one step() = one batched forward over whoever's running right now
1.10 Tracing one request, end to end
Let's follow "Tell me a joke" through the offline path, naming each stop (you'll do this live in
lab-01, and read the real code in the deep-dive):
LLM.generate(["Tell me a joke"])tokenizes the prompt and builds anEngineCoreRequest.add_requestwraps it as aRequest(num_tokens=5,num_computed_tokens=0, statusWAITING) and enqueues it in the scheduler.- Tick 1 (prefill):
schedule()admits it →RUNNING, allocates KV blocks for 5 tokens;execute_modelruns the forward over all 5 prompt tokens;sample_tokensproduces" Why";update_from_outputsetsnum_computed_tokens=5, appends" Why"(num_tokens=6). - Ticks 2..N (decode): each tick schedules 1 new token for this request, runs the model,
samples the next token, appends it.
num_computed_tokenschasesnum_tokens, one step at a time. - Finish: when the model emits the EOS token (or hits
max_tokens),update_from_outputmarks itFINISHED_*, frees its KV blocks, and the detokenizer turns the token IDs into the final string (streamed token-by-token on the server path).
That's the whole life of a request. Notice tick 1 processes many tokens (prefill, compute-bound) and every later tick processes one (decode, memory-bound) — Phase 0's two phases, now visible in the loop.
1.11 The mental model to carry forward
front door (LLM / AsyncLLM)
→ EngineCore.step loop: schedule → execute → sample → update
├─ schedule/update ........ Phases 2, 3 (memory & batching)
├─ execute_model ........ Phases 4–7, 10, 13, 14 (kernels, quant, parallelism, models)
└─ sample_tokens ........ Phases 8, 9, 12 (decoding, spec, structured)
→ detokenize / stream ........ Phase 16 (the serving API)
Every later phase is a zoom into one box of EngineCore.step. You now have the table of contents
for the entire book. When a later chapter says "this happens during execute_model" or "the
scheduler decides X," you'll know exactly where in this picture you are.
1.12 What you'll do in this phase
- Read: 01-deep-dive.md —
LLM.generate,EngineCore.step,LLMEngine,AsyncLLM, and the Executor→Worker→ModelRunner chain, with verified line anchors. - Build: 02-mini-build.md — add lifecycle tracing to
mini_vllm. - Labs (see labs/README.md for the full guide to each):
lab-01-trace-a-request[CPU-OK]— instrumentmini_vllmto record a request's full lifecycle (states + the two counters, per step) and assert it matches the WAITING→RUNNING→FINISHED path.lab-02-read-the-real-loop[GPU-OPT]— run real vLLM with debug logging and correlate the output tocore.py:step()(captured output included).lab-03-engine-step-by-hand[CPU-OK]— rebuildLLMEngine.stepfrom the scheduler/model/ sampler and prove it token-for-token identical to the real loop (incl. theneeds_sampleguard).lab-04-watch-the-batch[CPU-OK]— probe the scheduler and record per-step batch composition: chunking, deferred admission, and mixed prefill+decode steps, measured.lab-05-stop-conditions[CPU-OK]— EOS vsmax_tokensvsignore_eos, the boundary tie, and the status→finish_reasonmapping every API consumer depends on.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
You're ready to move on when you can draw the request's journey from generate() to a streamed
token, name every object and component it becomes/touches, recite the four stages of step() and
which phase owns each, and explain why the engine runs in its own process and why continuous
batching is what makes the whole thing economical.
← Phase 00 · Course home · Phase 02 →