Phase 16 — Serving APIs & Parsers

← Phase 15 · Course home · Phase 17 →

Don't Panic
Why this phase matters
What you'll learn
The map: where this lives in the real code
Labs in this phase
How to work this phase
Where you are

Don't Panic

Almost no one calls vLLM in Python in production — they hit its HTTP server, which speaks the OpenAI API (and the Anthropic Messages API, and gRPC). On top of raw generation it adds chat templating, streaming (SSE), tool calling, and reasoning parsers. This phase is the front door everyone actually uses.

Why this phase matters

The API server is where correctness meets the real world: streaming semantics, tool-call extraction, error handling, and OpenAI compatibility quirks. Tool/reasoning parsers are a frequent contribution area and a place small bugs cause big incidents.

What you'll learn

The OpenAI-compatible server: /v1/chat/completions, /v1/completions, /v1/embeddings
Chat templates and how messages become a token prompt
Streaming via Server-Sent Events; delta semantics
Tool/function calling: schema in, tool_calls out; the tool-call parsers
Reasoning parsers (separating chain-of-thought from the answer)
Anthropic Messages API and gRPC front-ends

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

vllm/entrypoints/openai/api_server.py — The FastAPI app + routes.
vllm/entrypoints/openai/serving_chat.py — Chat completions: templating, streaming, tools.
vllm/entrypoints/openai/tool_parsers/ — Per-model tool-call parsers (the pluggable bit).
vllm/entrypoints/openai/reasoning_parsers/ — Reasoning/think-tag parsers.
vllm/entrypoints/ — Look for the Anthropic Messages + gRPC entrypoints.

Labs in this phase

lab-01-tool-call-parser [CPU-OK] — batch + streaming tool-call parsing with the hold-back discipline (half-tags never leak, false alarms release), proven chunking-invariant by fuzz.
lab-02-openai-server-smoke [GPU-OPT] — vllm serve + the OpenAI client end to end, then the source trace through serving_chat: every response artifact assigned to its layer. Captured output included.
lab-03-streaming-detokenizer [CPU-OK] — the byte boundary: an incremental detokenizer that never emits broken UTF-8 (🚀 = three silences and a rocket), with the naive per-token decoder kept as a failing control.

See labs/README.md for the recommended order (03 → 01 → 02) and how to run them.

How to work this phase

Read this guide for intuition.
Read 01-deep-dive.md with the upstream/ files open.
Do 02-mini-build.md — build the mini_vllm piece yourself.
Run the labs, then attempt EXERCISES.md.
Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.