Phase 16 Labs — Serving APIs & Parsers
Three labs on the front door: the layer that turns an inference engine into an API
product. The arc: parse tool calls out of a token stream, streaming-safely
(lab-01), go a level down to the byte boundary — the detokenizer that never emits
broken UTF-8 (lab-03), then run the whole door — vllm serve, the OpenAI client,
and a source trace that assigns every response artifact to its layer (lab-02).
Recommended order: 03 → 01 → 02 (bytes, then tags, then the server that composes
both). CPU labs follow the standard contract — starter.py (your work),
solution.py (reference), test_lab.py (the spec); default runs the solution,
LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-16-serving-apis-and-parsers/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q
Contents
- lab-01-tool-call-parser
[CPU-OK] - lab-02-openai-server-smoke
[GPU-OPT] - lab-03-streaming-detokenizer
[CPU-OK] - What you can do after this phase
Labs
lab-01-tool-call-parser [CPU-OK]
Batch and streaming parsers for <tool_call> blocks, with the discipline that defines
the streaming one: hold back any trailing text that might still become a tag, release
it on false alarms, never leak half-tags to the user. Proven chunking-invariant by a
50-random-slicings fuzz. Skills: chunking invariance as the incremental parser's
contract; hold-back buffers; per-model conventions as trained-in templates; loud
failure for malformed calls.
lab-02-openai-server-smoke [GPU-OPT]
vllm serve + the OpenAI client: streamed deltas, a structured tool call,
deliberate 400s, and a mid-stream disconnect (watch the abort free its KV). Then the
source trace — route → validation → chat template → AsyncLLM.generate → the
detokenizer/parser pipeline → SSE — with the framing question per leg: translation or
inference? Annotated capture included. Skills: the server as translator; chat
templates as derived state; finish_reason: "tool_calls"; front-door latency as its
own budget.
lab-03-streaming-detokenizer [CPU-OK]
The byte boundary: 🚀 is four byte-tokens, and per-token decoding emits garbage three
times — build the incremental detokenizer that emits only complete UTF-8 characters
(lead-byte arithmetic, hold the tail, honest � on real truncation), with the naive
approach kept as a failing control. Skills: the emit-eagerly-but-never-emit-what-
might-change pattern (third appearance); why English-only testing is a blind spot;
where character responsibility ends and grapheme rendering begins.
What you can do after this phase
Trace any API response artifact to the layer that produced it; pair models with
their tool parsers and chat templates deliberately; build streaming text pipelines
out of composable hold-back buffers (detokenize → stop-match → tag-parse) and test
them with chunking fuzzes; and read vllm/entrypoints/openai/ as a translation
layer over the engine you already know down to its counters. Phase 17 goes the other
direction from the front door — down to the hardware the engine runs on.