Phase 01 — Exercises: Architecture & Request Lifecycle
Contents
Warm-up (explain)
- Name the four stages of
EngineCore.stepand the course phase that owns each. - What's the difference between
LLMandAsyncLLM? What do they share? - List the objects a request becomes: prompt → ? → ? → ? →
RequestOutput.
Core (trace the code)
- In
EngineCore.step(core.py:428), which stage can returnNone, and what is called then? - Who owns the GPU: Executor, Worker, or ModelRunner? What does each do?
- Why does V1 run
EngineCorein its own process? What crosses the boundary?
Build (your lab)
- In lab-01, at which step does
num_computed_tokensfirst equal the prompt length, and why? - Extend
trace_requestto trace two requests at once; observe how the scheduler interleaves them across steps (continuous batching, Phase 3). - Add a
WAITINGsnapshot (before the first schedule) to your trace. Why is there usually only one WAITING tick for a lone request on an idle engine?
Design (staff-level)
- A user reports high TTFT but normal ITL. Which stage(s) of
stepwould you investigate, and which phase's knobs (2/3/5) would you reach for? - You're asked to add a new API surface (e.g. a gRPC endpoint). Which layer do you build it at, and what must it produce/consume to reuse the existing core unchanged?
- Explain why detokenization runs off the core's hot path in the server. What would break if it
ran inside
EngineCore.step?
Self-grading
4–6 and 10–12 are interview-grade. Could you draw the full request journey and name every file? If not, re-read 01-deep-dive.md §"The whole journey, named".