Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 01 — Exercises: Architecture & Request Lifecycle

Contents


Warm-up (explain)

  1. Name the four stages of EngineCore.step and the course phase that owns each.
  2. What's the difference between LLM and AsyncLLM? What do they share?
  3. List the objects a request becomes: prompt → ? → ? → ? → RequestOutput.

Core (trace the code)

  1. In EngineCore.step (core.py:428), which stage can return None, and what is called then?
  2. Who owns the GPU: Executor, Worker, or ModelRunner? What does each do?
  3. Why does V1 run EngineCore in its own process? What crosses the boundary?

Build (your lab)

  1. In lab-01, at which step does num_computed_tokens first equal the prompt length, and why?
  2. Extend trace_request to trace two requests at once; observe how the scheduler interleaves them across steps (continuous batching, Phase 3).
  3. Add a WAITING snapshot (before the first schedule) to your trace. Why is there usually only one WAITING tick for a lone request on an idle engine?

Design (staff-level)

  1. A user reports high TTFT but normal ITL. Which stage(s) of step would you investigate, and which phase's knobs (2/3/5) would you reach for?
  2. You're asked to add a new API surface (e.g. a gRPC endpoint). Which layer do you build it at, and what must it produce/consume to reuse the existing core unchanged?
  3. Explain why detokenization runs off the core's hot path in the server. What would break if it ran inside EngineCore.step?

Self-grading

4–6 and 10–12 are interview-grade. Could you draw the full request journey and name every file? If not, re-read 01-deep-dive.md §"The whole journey, named".