Phase 01 — Cheatsheet: Architecture & Request Lifecycle
Contents
The journey
LLM.generate / AsyncLLM -> EngineCore.step (loop) -> Detokenizer -> RequestOutput
step = schedule (Ph3) -> execute_model (Ph4-14) -> sample (Ph9) -> update_from_output (Ph3)
Entry points
- Offline:
LLM(model=...).generate(prompts)(entrypoints/llm.py:422) →LLMEngine. - Online: OpenAI server →
AsyncLLM(v1/engine/async_llm.py). Same core, async + streaming.
The compute chain
EngineCore → Executor (1 or N workers) → Worker (owns one GPU) → ModelRunner
(gpu_model_runner.py: SchedulerOutput → tensors → forward → sample).
Objects that flow
prompt+SamplingParams → EngineCoreRequest → Request (counters+status) → SchedulerOutput
→ ModelRunnerOutput → RequestOutput.
Lifecycle
WAITING → RUNNING → FINISHED_* ; PREEMPTED → WAITING (Phase 3). RequestStatus (request.py:315).
Process model
EngineCore runs in its own process (EngineCoreProc, core.py:835) — tight loop off the GIL;
detokenization runs server-side, off the hot path.
Key upstream
entrypoints/llm.py:422generate·v1/engine/llm_engine.py:209/287add_request/stepv1/engine/core.py:428step·:337add_request ·:835EngineCoreProcv1/engine/async_llm.pyAsyncLLM·v1/worker/gpu_model_runner.pythe runner
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md