Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 16-02 — The OpenAI Server, End to End [GPU-OPT]

The CPU labs built the two text-pipeline stages (detokenizer, tool parser); this lab runs the whole front door: vllm serve, the OpenAI client, a streamed chat completion, and a tool call — then traces one request through the server source (serving_chat.py) so the HTTP layer stops being a fog between you and the engine you know. The payoff observation: everything from Phase 1 onward sits behind one async generator call — the server is a translator, not a second engine.

No GPU? Don't panic. The captured exchange below is annotated; the source trace is hardware-free.

Contents


Why this lab exists

Production vLLM is touched through this server far more often than through LLM() — and most operational questions ("why did this request 400?", "where do sampling defaults come from?", "what adds the latency between client and first token?") are server-layer questions. The trace this lab walks — FastAPI route → request validation → chat-template rendering → AsyncLLM.generate → per-token streaming through detokenizer/parsers → SSE chunks — is the request's actual itinerary, and each leg is a place you'll someday debug. The lab's framing question for every leg: is this translation (server's job) or inference (engine's job)? Keeping that line sharp is what makes the 20k-line entrypoints directory navigable.

Requirements

uv pip install -e ".[vllm]" openai
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct   # small instruct model with tool support

Steps

  1. Serve (note the parser flags — lab-01's convention pairing):
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 \
  --enable-auto-tool-choice --tool-call-parser hermes
  1. Stream a chat completion and watch the deltas arrive:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="-")

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    messages=[{"role": "user", "content": "Say hi in French, one word."}],
    stream=True)
for chunk in stream:
    print(repr(chunk.choices[0].delta.content), end=" ")
  1. Force a tool call (define one tool, ask a matching question) and inspect the structured tool_calls in the response — lab-01's parser output, arriving over HTTP.

  2. Misbehave on purpose: oversized max_tokens (read the 400's error body — validation is the server's first translation), a wrong model name, and a request with stream=true killed mid-stream (watch the server log the disconnect and the engine abort the request — Phase 1 lab-05's FINISHED_ABORTED, finally observed).

Captured output (real run, Qwen2.5-0.5B-Instruct, L4, vLLM 0.22.1, trimmed)

INFO ... Started server process; Application startup complete.    (Uvicorn + FastAPI)
INFO ... "POST /v1/chat/completions HTTP/1.1" 200 OK
None ' Bon' 'jour' ' !' None      # deltas: first None = role chunk, last = finish chunk
# tool call response (non-streamed):
"tool_calls": [{"type": "function", "function":
    {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}}],
"finish_reason": "tool_calls"
# the deliberate 400:
{"error": {"message": "max_tokens must be at most 32768 ...", "type": "BadRequestError"}}

Tracing the request through the source

Open these in order, one request in mind:

  1. upstream/vllm/entrypoints/openai/api_server.py — the FastAPI route; finds the handler per endpoint. (Translation: HTTP ↔ python objects.)
  2. upstream/vllm/entrypoints/openai/serving_chat.py — the heart: create_chat_completion validates, renders the chat template (messages → the model's prompt format — the per-model convention lab-01's parser is the inverse of), builds SamplingParams from the request body (every Phase 9 knob, arriving as JSON), and calls AsyncLLM.generate — the only line where inference happens.
  3. The streaming loop just below — consumes engine outputs, runs the detokenizer-fed deltas (lab-03's output!) through the tool parser (lab-01!), and yields SSE chunks with finish_reason mapped per Phase 1 lab-05.
  4. upstream/vllm/v1/engine/async_llm.pyAsyncLLM: the async wrapper over the EngineCore you traced in Phase 1 lab-02. The circle closes.

Hitchhiker's notes

  • The chat template is the most consequential invisible step: the same messages render differently per model (system-prompt placement, tool-schema injection, generation prompt), and template mismatches are the top cause of "model is dumb via API but fine in the playground." --chat-template overrides it; the template ships in the tokenizer config. The server's prompt is derived state — when debugging quality, print it (add_generation_prompt, the works) before blaming weights.
  • finish_reason: "tool_calls" — a third value joining Phase 1 lab-05's "stop"/"length": set when the parser extracted calls, telling the client to execute and continue the loop. The enum keeps earning.
  • One server, many surfaces: the same process exposes /v1/completions, /v1/chat/completions, embeddings, and (version-dependent) Anthropic-style routes — all translating onto the same AsyncLLM. API multiplexing is cheap because the engine boundary is clean; that's the architectural moral of the whole phase.
  • Disconnect handling is a correctness feature: a client that vanishes mid-stream must abort its request (free KV! — Phase 2's blocks don't free themselves), and the server's disconnect-watcher → abort_request path is what stands between you and a slow leak under flaky clients. Your step-4 experiment watched it work; know where it lives (api_server's disconnect checks).

Reflect

  • For each captured artifact, name the layer that produced it: the None role chunk (server's SSE framing), ' Bon' (engine token → lab-03 detokenizer → delta), the structured tool_calls (lab-01's parser), the 400 (validation — never reached the engine). If every artifact has an owner, the fog is gone.
  • The OpenAI contract returns arguments as a JSON string, not an object — and your lab-01 parser emitted dicts. Where must the re-serialization live, and why there? (The server's translation layer: the contract is the client's, the dict is internal. Translation owns format debts.)
  • What's the latency budget of the server layer itself? Measure: time-to-first- delta minus engine TTFT (from metrics) ≈ template render + validation + HTTP. If that gap grows under load, you're CPU-bound in the front door — a real failure mode (event-loop starvation) that no GPU dashboard will show you.

References

  • upstream/vllm/entrypoints/openai/serving_chat.py — the file this lab makes readable.
  • upstream/vllm/v1/engine/async_llm.py — the engine's async face.
  • vLLM docs, OpenAI-Compatible Server — endpoints, flags, template overrides: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
  • Labs 01 and 03 — the two pipeline stages this server composes; Phase 1 lab-02 — the engine loop at the bottom of the stack.