Lab 16-01 — Tool-Call Parsing: Structure Out of a Token Stream `[CPU-OK]`

A tool-calling model doesn't emit function calls — it emits text that describes function calls (<tool_call>{"name": …}</tool_call> for Hermes-style models; [TOOL_CALLS] for Mistral; a dozen other conventions). The server's job is to turn that text into the OpenAI response's structured tool_calls field — and to do it while streaming, over chunks that can split the tag or the JSON anywhere. The batch parser is twenty easy lines; the streaming parser is where every real bug in vLLM's tool_parsers/ directory lives, and its central discipline is the lab's takeaway: hold back any text that might still become a tag — emit "Sure. " immediately, but keep "<tool" buffered until the next chunk says whether it's a tool call or the user's <today>.

Why this lab exists
Background: the two parsers
Files
Run
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

Tool calling is the load-bearing feature of the agent era, and its serving-side reality is unglamorous: per-model text conventions, parsed incrementally, under the OpenAI API's streaming contract (content deltas must flow immediately; tool calls must arrive structured). vLLM ships ~20 parser plugins (upstream/vllm/entrypoints/ openai/tool_parsers/) that all solve this lab with different tag conventions — and their bug tracker is a museum of exactly the cases this lab's tests pin: tags split across chunks leaking half-tags into chat UIs, held-back text swallowed forever on false alarms, malformed JSON crashing streams instead of failing requests.

The streaming-equals-batch fuzz test is the lab's methodological gift: 50 random chunkings of the same text, all required to reassemble to the batch parse. Chunking invariance is the property every incremental parser owes, and randomized chunk boundaries are how you test it — the same move as Phase 8 lab-03's distributional oracle, applied to parsing.

Background: the two parsers

Batch (parse_tool_calls): scan for OPEN…CLOSE blocks, JSON-parse each, return (remaining content, calls). Malformed JSON raises — a call the executor can't parse must 4xx at the server, not detonate downstream (the loud-failure habit from Phase 14 lab-03).

Streaming (StreamingToolParser): a buffer and one bit of state (in_block). Outside a block, emit text eagerly except the longest trailing proper-prefix of OPEN — the hold-back. Inside, buffer silently until CLOSE (partial JSON is never parseable, so nothing useful can be emitted early), then parse and emit the call. finish() flushes held text and makes an unterminated block loud — the finish_reason: "length" interaction from Phase 12 lab-02, parser edition: a stream truncated mid-call is an error, not a tool call.

Files

starter.py — parse_tool_calls and StreamingToolParser (feed/finish). Your work.
solution.py — reference (note _trailing_tag_prefix: the hold-back, isolated).
test_lab.py — batch semantics, the 50-chunking fuzz, the split-tag leak test, the false-alarm release, and the unterminated-block failure.

Run

LAB_IMPL=starter pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q
pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q   # reference

What the tests prove

Test	What it pins
`test_batch_parse` / `test_multiple_calls_in_order`	The structured extraction, content preserved around it, order kept
`test_malformed_json_is_loud`	Garbage in a block raises — the server's chance to fail the request instead of the agent loop
`test_streaming_equals_batch_for_any_chunking`	Chunking invariance, 50 random slicings — the incremental parser's defining property
`test_tag_split_across_chunks_is_not_leaked`	`"Sure. <tool"` emits `"Sure. "` and holds `"<tool"` — half-tags never reach the user (the chat-UI-shows-`<tool` bug, prevented)
`test_false_alarm_prefix_is_released`	`"<to"` + `"day>"` → `"<today>"` emitted intact — held-back is not swallowed (the opposite bug, equally real)
`test_unterminated_block_fails_at_finish`	Truncation inside a call is an error, matching the Phase 12 hygiene rule

Hitchhiker's notes

Why per-model parsers at all? The tag convention is trained into each model (Hermes, Mistral, Llama, Qwen each render tool calls differently in their chat templates), so the parser must match the template — --tool-call-parser hermes pairs with the model the same way Phase 14's mapping table pairs with a checkpoint. Mismatched parser ⇒ tool calls stream as visible text: instantly recognizable once you've done this lab.
The OpenAI streaming contract adds a layer your events map onto: tool-call deltas (tool_calls[i].function.arguments streamed as JSON fragments). Real parsers emit partial-argument deltas for responsiveness — which requires incremental JSON parsing too (is this string complete? is the brace balanced?). Your buffer-until-close design is the correctness-first version; the delta-streaming upgrade is the going-further.
Constrained decoding (Phase 12) and parsing are complements, not rivals: the grammar mask can guarantee the model emits well-formed <tool_call> JSON (vLLM's tool-choice enforcement does exactly this), and the parser still must extract it from the stream. Guarantee the syntax, then parse it — belt and suspenders, both load-bearing.
The hold-back has a latency cost: a trailing < waits one chunk before display. Imperceptible — but the general trade (display latency vs structural certainty) recurs in stop-string handling (Phase 1 lab-05's straddle problem) and reasoning-tag parsers. Same buffer discipline everywhere; vLLM's detokenizer and parsers share it.

Going further

Add streaming argument deltas: inside a block, emit ("args_delta", fragment) events for completed JSON string portions — you'll need a brace/quote tracker (a mini Phase 12 lab-03 machine), and you'll understand why upstream parsers carry exactly one.
Implement a second convention (Mistral's [TOOL_CALLS][{...}]) behind the same event interface, and a get_parser(name) registry — the plugin shape of tool_parsers/, reproduced.
Property-test with adversarial content: tool-call JSON whose string values contain </tool_call>. Your parser breaks (find -> escape-aware scanning). Upstream's do too, mostly — models are trained not to emit this, which is a contract worth knowing is social, not technical.

References

upstream/vllm/entrypoints/openai/tool_parsers/ — the plugin zoo; hermes_tool_parser.py is your lab with delta streaming.
vLLM docs, Tool Calling — parser selection and --enable-auto-tool-choice: https://docs.vllm.ai/en/latest/features/tool_calling/
OpenAI API reference, function calling & streaming — the contract being satisfied: https://platform.openai.com/docs/guides/function-calling
Phase 12 — the masks that can guarantee what this lab parses; lab-03 — the same buffering discipline one level down, at the byte boundary.

vLLM Mastery — From Zero to Maintainer