Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 01 — Cheatsheet: Architecture & Request Lifecycle

Contents


The journey

LLM.generate / AsyncLLM  ->  EngineCore.step (loop)  ->  Detokenizer  ->  RequestOutput
   step = schedule (Ph3) -> execute_model (Ph4-14) -> sample (Ph9) -> update_from_output (Ph3)

Entry points

  • Offline: LLM(model=...).generate(prompts) (entrypoints/llm.py:422) → LLMEngine.
  • Online: OpenAI server → AsyncLLM (v1/engine/async_llm.py). Same core, async + streaming.

The compute chain

EngineCoreExecutor (1 or N workers) → Worker (owns one GPU) → ModelRunner (gpu_model_runner.py: SchedulerOutput → tensors → forward → sample).

Objects that flow

prompt+SamplingParamsEngineCoreRequestRequest (counters+status) → SchedulerOutputModelRunnerOutputRequestOutput.

Lifecycle

WAITING → RUNNING → FINISHED_* ; PREEMPTED → WAITING (Phase 3). RequestStatus (request.py:315).

Process model

EngineCore runs in its own process (EngineCoreProc, core.py:835) — tight loop off the GIL; detokenization runs server-side, off the hot path.

Key upstream

  • entrypoints/llm.py:422 generate · v1/engine/llm_engine.py:209/287 add_request/step
  • v1/engine/core.py:428 step · :337 add_request · :835 EngineCoreProc
  • v1/engine/async_llm.py AsyncLLM · v1/worker/gpu_model_runner.py the runner

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md