Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 16 Labs — Serving APIs & Parsers

Three labs on the front door: the layer that turns an inference engine into an API product. The arc: parse tool calls out of a token stream, streaming-safely (lab-01), go a level down to the byte boundary — the detokenizer that never emits broken UTF-8 (lab-03), then run the whole doorvllm serve, the OpenAI client, and a source trace that assigns every response artifact to its layer (lab-02).

Recommended order: 03 → 01 → 02 (bytes, then tags, then the server that composes both). CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-16-serving-apis-and-parsers/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q

Contents


Labs

lab-01-tool-call-parser [CPU-OK]

Batch and streaming parsers for <tool_call> blocks, with the discipline that defines the streaming one: hold back any trailing text that might still become a tag, release it on false alarms, never leak half-tags to the user. Proven chunking-invariant by a 50-random-slicings fuzz. Skills: chunking invariance as the incremental parser's contract; hold-back buffers; per-model conventions as trained-in templates; loud failure for malformed calls.

lab-02-openai-server-smoke [GPU-OPT]

vllm serve + the OpenAI client: streamed deltas, a structured tool call, deliberate 400s, and a mid-stream disconnect (watch the abort free its KV). Then the source trace — route → validation → chat template → AsyncLLM.generate → the detokenizer/parser pipeline → SSE — with the framing question per leg: translation or inference? Annotated capture included. Skills: the server as translator; chat templates as derived state; finish_reason: "tool_calls"; front-door latency as its own budget.

lab-03-streaming-detokenizer [CPU-OK]

The byte boundary: 🚀 is four byte-tokens, and per-token decoding emits garbage three times — build the incremental detokenizer that emits only complete UTF-8 characters (lead-byte arithmetic, hold the tail, honest on real truncation), with the naive approach kept as a failing control. Skills: the emit-eagerly-but-never-emit-what- might-change pattern (third appearance); why English-only testing is a blind spot; where character responsibility ends and grapheme rendering begins.

What you can do after this phase

Trace any API response artifact to the layer that produced it; pair models with their tool parsers and chat templates deliberately; build streaming text pipelines out of composable hold-back buffers (detokenize → stop-match → tag-parse) and test them with chunking fuzzes; and read vllm/entrypoints/openai/ as a translation layer over the engine you already know down to its counters. Phase 17 goes the other direction from the front door — down to the hardware the engine runs on.