Phase 01 — Interview Questions: Architecture & Request Lifecycle
Q1. Walk me through what happens between LLM.generate(prompt) and the first token.
Model answer
generate tokenizes the prompt and builds an EngineCoreRequest; add_request wraps it in a
Request and enqueues it in the scheduler. Then the engine loops EngineCore.step: the
scheduler picks the request and how many tokens to compute (the whole prompt, as prefill), the
executor runs the model on the assembled batch via a worker/model-runner, the sampler produces
the first token, and update_from_output advances num_computed_tokens and records the token.
The output processor detokenizes and returns/streams it. (llm.py:422 → core.py:428.)
Q2. What are the four stages of the engine step?
Model answer
schedule() (who runs, how many tokens — Phase 3), execute_model() (run the forward pass on a
worker/model-runner — Phases 4–14), sample_tokens() (pick the next token — Phase 9), and
update_from_output() (advance counters, reap finished requests — Phase 3). Everything in vLLM is
a deep dive into one of these. (core.py:428.)
Q3. Executor vs Worker vs ModelRunner — who does what?
Model answer
The Executor (v1/executor/) is the engine's handle to compute; it owns one Worker for
single-GPU or N for tensor/pipeline parallel and fans execute_model out to them. A Worker
(gpu_worker.py) owns one GPU's device, model shard, and KV cache. The ModelRunner
(gpu_model_runner.py) turns a SchedulerOutput into input tensors + attention metadata, runs
the (CUDA-graphed) forward pass, and runs the sampler. This indirection is why the same engine
runs on 1 or 64 GPUs — only the Executor changes.
Q4. Why does V1 isolate EngineCore in its own process?
Model answer
To keep the tight GPU scheduling loop off the API server's event loop and free of GIL contention
with HTTP handling and detokenization, and to cleanly coordinate multi-GPU worker processes.
Requests cross the boundary as serialized EngineCoreRequests and results as EngineCoreOutputs;
output processing/detokenization runs server-side so it never stalls the core. (EngineCoreProc,
core.py:835.)
Q5. How do offline batch and online serving share code?
Model answer
Both are thin shells over the same EngineCore. LLM/LLMEngine is the synchronous batch shell
(add_request + pump step), AsyncLLM is the async/streaming shell for the OpenAI server. The
scheduling, execution, and sampling are identical; only the entry/exit (sync vs async, full
result vs streamed deltas) differ.
Rapid-fire
- Offline entry point?
LLM.generate(llm.py:422). - Online entry point?
AsyncLLMbehind the OpenAI server. - The heartbeat?
EngineCore.step(core.py:428). - Object the scheduler operates on?
Request(with status + counters). - What
update_from_outputdoes? Advancenum_computed_tokens, append tokens, reap finished.