Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 16 — Serving APIs & Parsers

Phase 15 · Course home · Phase 17

Contents


Don't Panic

Almost no one calls vLLM in Python in production — they hit its HTTP server, which speaks the OpenAI API (and the Anthropic Messages API, and gRPC). On top of raw generation it adds chat templating, streaming (SSE), tool calling, and reasoning parsers. This phase is the front door everyone actually uses.

Why this phase matters

The API server is where correctness meets the real world: streaming semantics, tool-call extraction, error handling, and OpenAI compatibility quirks. Tool/reasoning parsers are a frequent contribution area and a place small bugs cause big incidents.

What you'll learn

  • The OpenAI-compatible server: /v1/chat/completions, /v1/completions, /v1/embeddings
  • Chat templates and how messages become a token prompt
  • Streaming via Server-Sent Events; delta semantics
  • Tool/function calling: schema in, tool_calls out; the tool-call parsers
  • Reasoning parsers (separating chain-of-thought from the answer)
  • Anthropic Messages API and gRPC front-ends

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

Labs in this phase

  • lab-01-tool-call-parser [CPU-OK] — batch + streaming tool-call parsing with the hold-back discipline (half-tags never leak, false alarms release), proven chunking-invariant by fuzz.
  • lab-02-openai-server-smoke [GPU-OPT]vllm serve + the OpenAI client end to end, then the source trace through serving_chat: every response artifact assigned to its layer. Captured output included.
  • lab-03-streaming-detokenizer [CPU-OK] — the byte boundary: an incremental detokenizer that never emits broken UTF-8 (🚀 = three silences and a rocket), with the naive per-token decoder kept as a failing control.

See labs/README.md for the recommended order (03 → 01 → 02) and how to run them.

How to work this phase

  1. Read this guide for intuition.
  2. Read 01-deep-dive.md with the upstream/ files open.
  3. Do 02-mini-build.md — build the mini_vllm piece yourself.
  4. Run the labs, then attempt EXERCISES.md.
  5. Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.

Phase 15 · Course home · Phase 17