Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 01 — Interview Questions: Architecture & Request Lifecycle

Q1. Walk me through what happens between LLM.generate(prompt) and the first token.

Model answer

generate tokenizes the prompt and builds an EngineCoreRequest; add_request wraps it in a Request and enqueues it in the scheduler. Then the engine loops EngineCore.step: the scheduler picks the request and how many tokens to compute (the whole prompt, as prefill), the executor runs the model on the assembled batch via a worker/model-runner, the sampler produces the first token, and update_from_output advances num_computed_tokens and records the token. The output processor detokenizes and returns/streams it. (llm.py:422core.py:428.)

Q2. What are the four stages of the engine step?

Model answer

schedule() (who runs, how many tokens — Phase 3), execute_model() (run the forward pass on a worker/model-runner — Phases 4–14), sample_tokens() (pick the next token — Phase 9), and update_from_output() (advance counters, reap finished requests — Phase 3). Everything in vLLM is a deep dive into one of these. (core.py:428.)

Q3. Executor vs Worker vs ModelRunner — who does what?

Model answer

The Executor (v1/executor/) is the engine's handle to compute; it owns one Worker for single-GPU or N for tensor/pipeline parallel and fans execute_model out to them. A Worker (gpu_worker.py) owns one GPU's device, model shard, and KV cache. The ModelRunner (gpu_model_runner.py) turns a SchedulerOutput into input tensors + attention metadata, runs the (CUDA-graphed) forward pass, and runs the sampler. This indirection is why the same engine runs on 1 or 64 GPUs — only the Executor changes.

Q4. Why does V1 isolate EngineCore in its own process?

Model answer

To keep the tight GPU scheduling loop off the API server's event loop and free of GIL contention with HTTP handling and detokenization, and to cleanly coordinate multi-GPU worker processes. Requests cross the boundary as serialized EngineCoreRequests and results as EngineCoreOutputs; output processing/detokenization runs server-side so it never stalls the core. (EngineCoreProc, core.py:835.)

Q5. How do offline batch and online serving share code?

Model answer

Both are thin shells over the same EngineCore. LLM/LLMEngine is the synchronous batch shell (add_request + pump step), AsyncLLM is the async/streaming shell for the OpenAI server. The scheduling, execution, and sampling are identical; only the entry/exit (sync vs async, full result vs streamed deltas) differ.

Rapid-fire

  • Offline entry point? LLM.generate (llm.py:422).
  • Online entry point? AsyncLLM behind the OpenAI server.
  • The heartbeat? EngineCore.step (core.py:428).
  • Object the scheduler operates on? Request (with status + counters).
  • What update_from_output does? Advance num_computed_tokens, append tokens, reap finished.