Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 01 — The Hitchhiker's Guide to vLLM's Architecture & Request Lifecycle

Phase 00 · Course home · Phase 02

This is Chapter 1. Phase 0 taught you what one forward pass is and why it's slow. This chapter zooms out to the whole machine: how vLLM turns that single forward pass into a service that streams tokens to thousands of users at once. We build the architecture the way you'd design it yourself — by starting with the obvious naive server, watching it fail, and fixing each failure. By the end you'll be able to trace any request from an HTTP call to a streamed token and name every component it touches. That mental map is what lets you navigate 500,000 lines of code without drowning.

How to read this chapter. Everyday explanation throughout; paragraphs marked 🔬 Going deeper add the systems rigor for the expert track.


Contents


1.1 Don't Panic — the architecture in one breath

A request enters as a string and leaves as tokens. In between it passes through a handful of well-named components, and a tiny loop runs the model over and over until the request is done.

vLLM looks enormous, but the path a request takes is short:

  "Tell me a joke"
        │  tokenize, wrap as a request
        ▼
   LLM  /  AsyncLLM            ← the front door (offline batch  /  online server)
        │  add_request
        ▼
   EngineCore.step()  ──────────────── the heartbeat (runs every ~10–50 ms) ───────────┐
        │   1. schedule()              who runs this step, and how many tokens   (Ph 3) │
        │   2. execute_model()         run the model on the assembled batch     (Ph 4+) │  loop
        │   3. sample_tokens()         pick the next token for each sequence      (Ph 9)│  until
        │   4. update_from_output()    advance counters, retire finished reqs     (Ph 3)│  done
        ▼                                                                               │
   Detokenizer / OutputProcessor  ← token IDs → text, streamed back ───────────────────┘
        │
        ▼
   " Why did the function..."   (streamed token by token)

That five-line step() loop is vLLM. Every later phase is a deep dive into one box of it. The rest of this chapter explains why it's shaped this way and what each piece does.


1.2 Let's design it ourselves: why the naive server fails

The fastest way to understand vLLM's architecture is to build the obvious version in your head and watch it break. Each break motivates a real component.

Attempt 1 — a function call. "Just call model.generate(prompt) per request." This works for one user. But it serves requests one at a time: while user A's 500-token answer generates, users B–Z wait. And from Phase 0 §0.10, a single decode stream uses ~1% of the GPU (it's memory-bound at batch 1). You're paying for a Ferrari and driving it in a parking lot. → We must run many requests together (batching).

Attempt 2 — static batching. "Collect N requests, run them as a batch until all finish." Better GPU use, but two new problems:

  • Requests have different lengths. A batch finishes at the speed of its slowest member; short requests sit idle in the batch, wasting their slot. (We'll fix this with continuous batching — re-decide the batch every step — in Phase 3.)
  • New requests that arrive mid-batch must wait for the whole batch to finish before they can start. Terrible tail latency. → We need a component that re-plans the batch every single step: the scheduler.

Attempt 3 — scheduler + model, in one Python process, inside the web server. Now the GPU loop and the HTTP server share a process. Problems:

  • The tight, latency-critical GPU loop competes with HTTP parsing, JSON serialization, and detokenization for the single Python thread (the GIL). A burst of requests stalls the GPU.
  • Multi-GPU (Phase 10) needs multiple processes anyway. → Isolate the engine in its own process, talk to it over a queue. That's vLLM's V1 design.

So the architecture isn't arbitrary — each component is the answer to a specific failure of the naive version. Now let's name the real pieces.

🆕 New words: batching (run many requests together), static vs continuous batching, scheduler (re-plans the batch each step), the GIL (Python's single-thread lock — why the engine gets its own process).


1.3 Two front doors, one engine

vLLM has two entry points, and the crucial insight is that both are thin shells over the same engine core:

  • Offline / batch: LLM(model=...).generate(prompts)vllm/entrypoints/llm.py. You hand it a list of prompts; it returns a list of results when all are done. Synchronous. This is what mini_vllm's LLMEngine.generate mirrors, and what you use in scripts and evals.
  • Online / serving: an HTTP server (OpenAI-compatible, Phase 16) → AsyncLLM (vllm/v1/engine/async_llm.py) → the same core, but async and streaming — it yields each token as it's produced so the user sees text appear live.

Both funnel into EngineCore (vllm/v1/engine/core.py). Internalize this: batch and server are skins; the engine is one. When you fix something in the core, you fix it for both.


1.4 The objects a request becomes (and why each exists)

A request changes form as it travels — and each form is a deliberate data type. Knowing them means that when you read a stack trace, you instantly know which stage you're in by the type in hand.

ObjectLives betweenCarriesWhy it exists
prompt + SamplingParamsuser → serverthe text + decoding knobs (temperature, max_tokens, n, stop)the user's intent
EngineCoreRequestinput proc → coretokenized prompt + params + a request ida serializable unit to cross the process boundary
Requestinside the schedulerthe live request: token ids, num_computed_tokens / num_tokens, status, block tablethe engine's working state (Phase 0's two counters!)
SchedulerOutputscheduler → executorwho runs, how many tokens each, block tables, etc.the per-step plan
ModelRunnerOutputexecutor → coresampled token ids, logprobsthe model's result
RequestOutputcore → usergenerated text/tokens (a delta, when streaming)what the caller receives

🔬 Going deeper. The split between EngineCoreRequest (crosses the process boundary, so it's a plain serializable struct) and Request (rich, mutable, lives only inside the engine process) is not incidental — it's the seam where the IPC boundary sits (§1.6). And RequestOutput being a delta in streaming mode (only the new tokens since last time) is what makes server-sent-events streaming cheap. Naming is half of understanding a system; learn these six.

🆕 New words: SamplingParams, EngineCoreRequest, Request, SchedulerOutput, ModelRunnerOutput, RequestOutput.


1.5 The heartbeat dissected: EngineCore.step()

The engine is a loop. Each tick (step()) advances every in-flight request by some tokens. Here is the loop with each stage explained — this is the spine of the whole system:

def step():
    scheduler_output = self.scheduler.schedule()                    # 1. PLAN
    model_output     = self.model_executor.execute_model(...)       # 2. RUN
    # (sampling happens inside/after execute; shown separate for clarity)
    sampled          = self.model_executor.sample_tokens(...)       # 3. PICK
    outputs          = self.scheduler.update_from_output(...)       # 4. BOOKKEEP
    return outputs
  1. Schedule (the plan) — the scheduler looks at every waiting and running request and the free KV memory, and decides: who runs this step, and how many tokens does each get? This is where continuous batching, chunked prefill, prefix caching, and preemption happen (Phases 2–3). Output: a SchedulerOutput.
  2. Execute (run the model) — the executor turns that plan into actual tensors (gather the scheduled tokens, build the attention metadata — block tables and sequence lengths from Phases 2–4) and runs the forward pass on the GPU (possibly as a CUDA graph, Phase 5). This is where kernels, quantization, parallelism, and the model itself live (Phases 4–7, 10, 13, 14).
  3. Sample (pick tokens) — turn the model's logits into one new token per sequence, applying each request's own sampling params, grammar masks, etc. (Phases 8, 9, 12).
  4. Bookkeep (update) — append the sampled tokens, advance each request's num_computed_tokens, detect which requests just finished (hit EOS or max length), free their KV blocks, and emit outputs (Phase 3).

Then it loops. A request might be touched by a few hundred ticks over its lifetime (one per output token, after prefill). Every box of this loop maps to a phase of the course — keep this diagram open as your table of contents.

🔬 Going deeper — the real step is even leaner. In core.py the four stages are visible almost verbatim (you'll read them in the deep-dive). Two production wrinkles: (a) execute_model can run asynchronously (return a future) so the scheduler can plan the next step while the GPU works on this one — overlapping CPU and GPU; (b) a grammar bitmask for structured output (Phase 12) is computed between schedule and sample. Don't let those obscure the four-beat rhythm: plan → run → pick → bookkeep.


1.6 The process architecture: why the engine lives alone

From §1.2, the engine must not share a Python thread with the web server. So V1 runs EngineCore in its own process (EngineCoreProc). The picture:

   ┌─────────────── API server process ───────────────┐        ┌──── EngineCore process ────┐
   │  HTTP / OpenAI endpoints  (Phase 16)              │        │  scheduler                 │
   │  tokenization, request validation                 │  IPC   │  the model + KV cache      │
   │  AsyncLLM  ── EngineCoreRequest ──────────────────┼───────▶│  step() loop               │
   │  detokenization, streaming  ◀── EngineCoreOutputs─┼────────┤                            │
   └───────────────────────────────────────────────────┘        └────────────────────────────┘

Why this split is worth a whole process boundary:

  • The scheduling loop stays tight — no HTTP work, JSON, or detokenization steals its thread or contends for the GIL. The GPU is never starved by web-server bookkeeping.
  • Detokenization and streaming run on the server side, off the engine's hot path — turning token IDs back into text and formatting SSE chunks happens in parallel with the next step().
  • It generalizes to multi-GPU: the core process becomes the coordinator of worker processes (next section).

The cost is that requests and outputs must be serialized across the boundary (that's why EngineCoreRequest/EngineCoreOutputs are plain structs, §1.4). It's a price worth paying for an uninterrupted GPU loop.

🆕 New words: IPC (inter-process communication), EngineCoreProc (the engine's own process), SSE (server-sent events — the streaming protocol).


1.7 Who actually touches the GPU: Executor → Worker → ModelRunner

EngineCore decides what to run; it does not run the model itself. That's delegated down a chain whose whole purpose is to make the same engine run on 1 GPU or 64:

EngineCore
  └─ Executor          (vllm/v1/executor/)   owns the worker(s); the engine's handle to compute
       └─ Worker        (vllm/v1/worker/gpu_worker.py)   one per GPU: holds a model shard + its KV cache
            └─ ModelRunner  (gpu_model_runner.py)   SchedulerOutput → input tensors → forward → sampler
  • Executor — for a single GPU it's a UniProcExecutor (just calls the one worker). For tensor/pipeline parallelism (Phase 10) it's a MultiprocExecutor that owns N worker processes and broadcasts each step's plan to all of them.
  • Worker — owns one GPU: its device, its slice of the model's weights, and its slice of the KV cache. Runs in lockstep with its peers.
  • ModelRunner — the busiest object in the engine. It takes the SchedulerOutput, prepares the input tensors (gathers the scheduled tokens, builds the attention metadata: block tables + sequence lengths + slot mapping — Phases 2/4), runs the (possibly CUDA-graphed) forward pass, and runs the sampler. You'll return to gpu_model_runner.py in Phases 4, 5, 9, 13.

The elegance: the model code is identical whether you run on 1 GPU or 64 — it just uses parallel layers, and the Executor fans the work out. Scaling out changes the Executor, nothing above it.

🔬 Going deeper. This is also where the prepare-inputs cost lives — assembling ragged, variable-length batches into padded tensors and metadata every step is real CPU work, and at small batch it can rival the GPU time. That's a major reason CUDA graphs (Phase 5) and careful tensor reuse matter, and why gpu_model_runner.py is so heavily optimized. When you profile a slow deployment (Phase 18), this file is a frequent suspect.


1.8 The request lifecycle: a state machine

Inside the engine, each request moves through a small set of states (RequestStatus in vllm/v1/request.py). Understanding the states — and especially the transitions — is how you reason about latency, fairness, and failures.

     (arrives)
        │
        ▼
   ┌─────────┐   admitted by      ┌─────────┐   generates a token   ┌──────────────────┐
   │ WAITING │ ───scheduler────▶ │ RUNNING │ ───each step────────▶ │ FINISHED_*       │
   └─────────┘   (KV allocated)  └─────────┘   until stop/maxlen    │ (STOPPED/LENGTH/ │
        ▲                            │                              │  ABORTED/ERROR)  │
        │      preempted: out of KV  │                              └──────────────────┘
        └────────── PREEMPTED ◀──────┘   (KV freed; re-admitted later, recomputed — Phase 3)
  • WAITING — admitted to the engine, queued, not yet running (no KV allocated yet).
  • RUNNING — actively generating; has KV blocks; touched every step.
  • PREEMPTED — was running, but the engine ran out of KV memory and evicted it to make progress on others; it goes back to WAITING and is recomputed when memory frees (Phase 3's safety valve).
  • FINISHED_* — terminal: hit a stop token (STOPPED), hit max length (LENGTH_CAPPED), was cancelled (ABORTED), or errored. Its KV is freed and the final output returned.

🔬 Going deeper. Real vLLM has extra "waiting" sub-states for requests blocked on something other than the queue: waiting for a structured-output grammar to compile (Phase 12), waiting for KV to arrive over the network in disaggregated serving (Phase 15), etc. They're still "not ready to run," just for richer reasons. Also note the enum ordering trick: is_finished is simply status > PREEMPTED, so the terminal states are defined by position in the enum — a tiny detail that makes the hot-path check branch-free. You'll trace this exact state machine in lab-01.

🆕 New words: RequestStatus, preemption (evict a running request under memory pressure), terminal/finished states.


1.9 How thousands of requests share the loop (the payoff)

Now connect the architecture back to Phase 0's physics. Why all this machinery? Because of §0.10: one decode stream wastes ~99% of the GPU. The architecture exists to keep many requests in flight so each step() decodes a big batch — amortizing the weight read and pushing arithmetic intensity toward the roofline ridge.

Crucially, because the scheduler re-plans every step (continuous batching, Phase 3), requests don't move in lockstep: the moment one finishes, its slot is freed and a WAITING request joins mid-flight, on the very next tick. So at any instant the running batch is a churning mix of requests at different stages — some doing their first (prefill) step, most adding one decode token. The loop in §1.5 absorbs all of that uniformly because, to it, every request is just "advance num_computed_tokens toward num_tokens" (Phase 0 §0.13). That uniformity is why one simple loop can serve a chaotic, ever-changing crowd.

time ─►
req A  [prefill][dec][dec][done]
req B        [prefill][dec][dec][dec][done]
req C                  [prefill][dec][dec]...        ← C joined the instant A's slot freed
        every column is one step() = one batched forward over whoever's running right now

1.10 Tracing one request, end to end

Let's follow "Tell me a joke" through the offline path, naming each stop (you'll do this live in lab-01, and read the real code in the deep-dive):

  1. LLM.generate(["Tell me a joke"]) tokenizes the prompt and builds an EngineCoreRequest.
  2. add_request wraps it as a Request (num_tokens=5, num_computed_tokens=0, status WAITING) and enqueues it in the scheduler.
  3. Tick 1 (prefill): schedule() admits it → RUNNING, allocates KV blocks for 5 tokens; execute_model runs the forward over all 5 prompt tokens; sample_tokens produces " Why"; update_from_output sets num_computed_tokens=5, appends " Why" (num_tokens=6).
  4. Ticks 2..N (decode): each tick schedules 1 new token for this request, runs the model, samples the next token, appends it. num_computed_tokens chases num_tokens, one step at a time.
  5. Finish: when the model emits the EOS token (or hits max_tokens), update_from_output marks it FINISHED_*, frees its KV blocks, and the detokenizer turns the token IDs into the final string (streamed token-by-token on the server path).

That's the whole life of a request. Notice tick 1 processes many tokens (prefill, compute-bound) and every later tick processes one (decode, memory-bound) — Phase 0's two phases, now visible in the loop.


1.11 The mental model to carry forward

   front door (LLM / AsyncLLM)
        → EngineCore.step loop:   schedule → execute → sample → update
              ├─ schedule/update  ........ Phases 2, 3   (memory & batching)
              ├─ execute_model    ........ Phases 4–7, 10, 13, 14  (kernels, quant, parallelism, models)
              └─ sample_tokens    ........ Phases 8, 9, 12  (decoding, spec, structured)
        → detokenize / stream     ........ Phase 16  (the serving API)

Every later phase is a zoom into one box of EngineCore.step. You now have the table of contents for the entire book. When a later chapter says "this happens during execute_model" or "the scheduler decides X," you'll know exactly where in this picture you are.


1.12 What you'll do in this phase

  • Read: 01-deep-dive.mdLLM.generate, EngineCore.step, LLMEngine, AsyncLLM, and the Executor→Worker→ModelRunner chain, with verified line anchors.
  • Build: 02-mini-build.md — add lifecycle tracing to mini_vllm.
  • Labs (see labs/README.md for the full guide to each):
    • lab-01-trace-a-request [CPU-OK] — instrument mini_vllm to record a request's full lifecycle (states + the two counters, per step) and assert it matches the WAITING→RUNNING→FINISHED path.
    • lab-02-read-the-real-loop [GPU-OPT] — run real vLLM with debug logging and correlate the output to core.py:step() (captured output included).
    • lab-03-engine-step-by-hand [CPU-OK] — rebuild LLMEngine.step from the scheduler/model/ sampler and prove it token-for-token identical to the real loop (incl. the needs_sample guard).
    • lab-04-watch-the-batch [CPU-OK] — probe the scheduler and record per-step batch composition: chunking, deferred admission, and mixed prefill+decode steps, measured.
    • lab-05-stop-conditions [CPU-OK] — EOS vs max_tokens vs ignore_eos, the boundary tie, and the status→finish_reason mapping every API consumer depends on.
  • Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

You're ready to move on when you can draw the request's journey from generate() to a streamed token, name every object and component it becomes/touches, recite the four stages of step() and which phase owns each, and explain why the engine runs in its own process and why continuous batching is what makes the whole thing economical.

Phase 00 · Course home · Phase 02