Lab 16-02 — The OpenAI Server, End to End [GPU-OPT]
The CPU labs built the two text-pipeline stages (detokenizer, tool parser); this lab
runs the whole front door: vllm serve, the OpenAI client, a streamed chat
completion, and a tool call — then traces one request through the server source
(serving_chat.py) so the HTTP layer stops being a fog between you and the engine
you know. The payoff observation: everything from Phase 1 onward sits behind one
async generator call — the server is a translator, not a second engine.
No GPU? Don't panic. The captured exchange below is annotated; the source trace is hardware-free.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, Qwen2.5-0.5B-Instruct, L4, vLLM 0.22.1, trimmed)
- Tracing the request through the source
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Production vLLM is touched through this server far more often than through LLM() —
and most operational questions ("why did this request 400?", "where do sampling
defaults come from?", "what adds the latency between client and first token?") are
server-layer questions. The trace this lab walks — FastAPI route → request
validation → chat-template rendering → AsyncLLM.generate → per-token streaming
through detokenizer/parsers → SSE chunks — is the request's actual itinerary, and
each leg is a place you'll someday debug. The lab's framing question for every leg:
is this translation (server's job) or inference (engine's job)? Keeping that line
sharp is what makes the 20k-line entrypoints directory navigable.
Requirements
uv pip install -e ".[vllm]" openai
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct # small instruct model with tool support
Steps
- Serve (note the parser flags — lab-01's convention pairing):
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 \
--enable-auto-tool-choice --tool-call-parser hermes
- Stream a chat completion and watch the deltas arrive:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="-")
stream = client.chat.completions.create(
model="Qwen/Qwen2.5-0.5B-Instruct",
messages=[{"role": "user", "content": "Say hi in French, one word."}],
stream=True)
for chunk in stream:
print(repr(chunk.choices[0].delta.content), end=" ")
-
Force a tool call (define one tool, ask a matching question) and inspect the structured
tool_callsin the response — lab-01's parser output, arriving over HTTP. -
Misbehave on purpose: oversized
max_tokens(read the 400's error body — validation is the server's first translation), a wrong model name, and a request withstream=truekilled mid-stream (watch the server log the disconnect and the engine abort the request — Phase 1 lab-05'sFINISHED_ABORTED, finally observed).
Captured output (real run, Qwen2.5-0.5B-Instruct, L4, vLLM 0.22.1, trimmed)
INFO ... Started server process; Application startup complete. (Uvicorn + FastAPI)
INFO ... "POST /v1/chat/completions HTTP/1.1" 200 OK
None ' Bon' 'jour' ' !' None # deltas: first None = role chunk, last = finish chunk
# tool call response (non-streamed):
"tool_calls": [{"type": "function", "function":
{"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}}],
"finish_reason": "tool_calls"
# the deliberate 400:
{"error": {"message": "max_tokens must be at most 32768 ...", "type": "BadRequestError"}}
Tracing the request through the source
Open these in order, one request in mind:
upstream/vllm/entrypoints/openai/api_server.py— the FastAPI route; finds the handler per endpoint. (Translation: HTTP ↔ python objects.)upstream/vllm/entrypoints/openai/serving_chat.py— the heart:create_chat_completionvalidates, renders the chat template (messages → the model's prompt format — the per-model convention lab-01's parser is the inverse of), buildsSamplingParamsfrom the request body (every Phase 9 knob, arriving as JSON), and callsAsyncLLM.generate— the only line where inference happens.- The streaming loop just below — consumes engine outputs, runs the
detokenizer-fed deltas (lab-03's output!) through the tool parser (lab-01!),
and yields SSE chunks with
finish_reasonmapped per Phase 1 lab-05. upstream/vllm/v1/engine/async_llm.py—AsyncLLM: the async wrapper over theEngineCoreyou traced in Phase 1 lab-02. The circle closes.
Hitchhiker's notes
- The chat template is the most consequential invisible step: the same messages
render differently per model (system-prompt placement, tool-schema injection,
generation prompt), and template mismatches are the top cause of "model is dumb
via API but fine in the playground."
--chat-templateoverrides it; the template ships in the tokenizer config. The server's prompt is derived state — when debugging quality, print it (add_generation_prompt, the works) before blaming weights. finish_reason: "tool_calls"— a third value joining Phase 1 lab-05's"stop"/"length": set when the parser extracted calls, telling the client to execute and continue the loop. The enum keeps earning.- One server, many surfaces: the same process exposes
/v1/completions,/v1/chat/completions, embeddings, and (version-dependent) Anthropic-style routes — all translating onto the sameAsyncLLM. API multiplexing is cheap because the engine boundary is clean; that's the architectural moral of the whole phase. - Disconnect handling is a correctness feature: a client that vanishes
mid-stream must abort its request (free KV! — Phase 2's blocks don't free
themselves), and the server's disconnect-watcher →
abort_requestpath is what stands between you and a slow leak under flaky clients. Your step-4 experiment watched it work; know where it lives (api_server's disconnect checks).
Reflect
- For each captured artifact, name the layer that produced it: the
Nonerole chunk (server's SSE framing),' Bon'(engine token → lab-03 detokenizer → delta), the structuredtool_calls(lab-01's parser), the 400 (validation — never reached the engine). If every artifact has an owner, the fog is gone. - The OpenAI contract returns
argumentsas a JSON string, not an object — and your lab-01 parser emitted dicts. Where must the re-serialization live, and why there? (The server's translation layer: the contract is the client's, the dict is internal. Translation owns format debts.) - What's the latency budget of the server layer itself? Measure: time-to-first- delta minus engine TTFT (from metrics) ≈ template render + validation + HTTP. If that gap grows under load, you're CPU-bound in the front door — a real failure mode (event-loop starvation) that no GPU dashboard will show you.
References
upstream/vllm/entrypoints/openai/serving_chat.py— the file this lab makes readable.upstream/vllm/v1/engine/async_llm.py— the engine's async face.- vLLM docs, OpenAI-Compatible Server — endpoints, flags, template overrides: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
- Labs 01 and 03 — the two pipeline stages this server composes; Phase 1 lab-02 — the engine loop at the bottom of the stack.