Phase 01 — Cheatsheet: Architecture & Request Lifecycle

The journey
Entry points
The compute chain
Objects that flow
Lifecycle
Process model
Key upstream

The journey

LLM.generate / AsyncLLM  ->  EngineCore.step (loop)  ->  Detokenizer  ->  RequestOutput
   step = schedule (Ph3) -> execute_model (Ph4-14) -> sample (Ph9) -> update_from_output (Ph3)

Entry points

Offline: LLM(model=...).generate(prompts) (entrypoints/llm.py:422) → LLMEngine.
Online: OpenAI server → AsyncLLM (v1/engine/async_llm.py). Same core, async + streaming.

The compute chain

EngineCore → Executor (1 or N workers) → Worker (owns one GPU) → ModelRunner (gpu_model_runner.py: SchedulerOutput → tensors → forward → sample).

Objects that flow

prompt+SamplingParams → EngineCoreRequest → Request (counters+status) → SchedulerOutput → ModelRunnerOutput → RequestOutput.

Lifecycle

WAITING → RUNNING → FINISHED_* ; PREEMPTED → WAITING (Phase 3). RequestStatus (request.py:315).

Process model

EngineCore runs in its own process (EngineCoreProc, core.py:835) — tight loop off the GIL; detokenization runs server-side, off the hot path.

Key upstream

entrypoints/llm.py:422 generate · v1/engine/llm_engine.py:209/287 add_request/step
v1/engine/core.py:428 step · :337 add_request · :835 EngineCoreProc
v1/engine/async_llm.py AsyncLLM · v1/worker/gpu_model_runner.py the runner

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

vLLM Mastery — From Zero to Maintainer