Phase 16 — Serving APIs & Parsers
← Phase 15 · Course home · Phase 17 →
Contents
- Don't Panic
- Why this phase matters
- What you'll learn
- The map: where this lives in the real code
- Labs in this phase
- How to work this phase
- Where you are
Don't Panic
Almost no one calls vLLM in Python in production — they hit its HTTP server, which speaks the OpenAI API (and the Anthropic Messages API, and gRPC). On top of raw generation it adds chat templating, streaming (SSE), tool calling, and reasoning parsers. This phase is the front door everyone actually uses.
Why this phase matters
The API server is where correctness meets the real world: streaming semantics, tool-call extraction, error handling, and OpenAI compatibility quirks. Tool/reasoning parsers are a frequent contribution area and a place small bugs cause big incidents.
What you'll learn
- The OpenAI-compatible server: /v1/chat/completions, /v1/completions, /v1/embeddings
- Chat templates and how messages become a token prompt
- Streaming via Server-Sent Events; delta semantics
- Tool/function calling: schema in, tool_calls out; the tool-call parsers
- Reasoning parsers (separating chain-of-thought from the answer)
- Anthropic Messages API and gRPC front-ends
The map: where this lives in the real code
Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see
UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md)
walks through the important ones line by line.
vllm/entrypoints/openai/api_server.py— The FastAPI app + routes.vllm/entrypoints/openai/serving_chat.py— Chat completions: templating, streaming, tools.vllm/entrypoints/openai/tool_parsers/— Per-model tool-call parsers (the pluggable bit).vllm/entrypoints/openai/reasoning_parsers/— Reasoning/think-tag parsers.vllm/entrypoints/— Look for the Anthropic Messages + gRPC entrypoints.
Labs in this phase
- lab-01-tool-call-parser
[CPU-OK]— batch + streaming tool-call parsing with the hold-back discipline (half-tags never leak, false alarms release), proven chunking-invariant by fuzz. - lab-02-openai-server-smoke
[GPU-OPT]—vllm serve+ the OpenAI client end to end, then the source trace through serving_chat: every response artifact assigned to its layer. Captured output included. - lab-03-streaming-detokenizer
[CPU-OK]— the byte boundary: an incremental detokenizer that never emits broken UTF-8 (🚀 = three silences and a rocket), with the naive per-token decoder kept as a failing control.
See labs/README.md for the recommended order (03 → 01 → 02) and how to run them.
How to work this phase
- Read this guide for intuition.
- Read 01-deep-dive.md with the
upstream/files open. - Do 02-mini-build.md — build the
mini_vllmpiece yourself. - Run the labs, then attempt EXERCISES.md.
- Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.
Where you are
This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.
← Phase 15 · Course home · Phase 17 →