Phase 01 — Exercises: Architecture & Request Lifecycle

Contents

Warm-up (explain)
Core (trace the code)
Build (your lab)
Design (staff-level)
Self-grading

Warm-up (explain)

Name the four stages of EngineCore.step and the course phase that owns each.
What's the difference between LLM and AsyncLLM? What do they share?
List the objects a request becomes: prompt → ? → ? → ? → RequestOutput.

Core (trace the code)

In EngineCore.step (core.py:428), which stage can return None, and what is called then?
Who owns the GPU: Executor, Worker, or ModelRunner? What does each do?
Why does V1 run EngineCore in its own process? What crosses the boundary?

Build (your lab)

In lab-01, at which step does num_computed_tokens first equal the prompt length, and why?
Extend trace_request to trace two requests at once; observe how the scheduler interleaves them across steps (continuous batching, Phase 3).
Add a WAITING snapshot (before the first schedule) to your trace. Why is there usually only one WAITING tick for a lone request on an idle engine?

Design (staff-level)

A user reports high TTFT but normal ITL. Which stage(s) of step would you investigate, and which phase's knobs (2/3/5) would you reach for?
You're asked to add a new API surface (e.g. a gRPC endpoint). Which layer do you build it at, and what must it produce/consume to reuse the existing core unchanged?
Explain why detokenization runs off the core's hot path in the server. What would break if it ran inside EngineCore.step?

Self-grading

4–6 and 10–12 are interview-grade. Could you draw the full request journey and name every file? If not, re-read 01-deep-dive.md §"The whole journey, named".