Lab 16-01 — Tool-Call Parsing: Structure Out of a Token Stream [CPU-OK]
A tool-calling model doesn't emit function calls — it emits text that describes
function calls (<tool_call>{"name": …}</tool_call> for Hermes-style models;
[TOOL_CALLS] for Mistral; a dozen other conventions). The server's job is to turn
that text into the OpenAI response's structured tool_calls field — and to do it
while streaming, over chunks that can split the tag or the JSON anywhere. The
batch parser is twenty easy lines; the streaming parser is where every real bug in
vLLM's tool_parsers/ directory lives, and its central discipline is the lab's
takeaway: hold back any text that might still become a tag — emit "Sure. "
immediately, but keep "<tool" buffered until the next chunk says whether it's a
tool call or the user's <today>.
Contents
- Why this lab exists
- Background: the two parsers
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Tool calling is the load-bearing feature of the agent era, and its serving-side
reality is unglamorous: per-model text conventions, parsed incrementally, under the
OpenAI API's streaming contract (content deltas must flow immediately; tool calls
must arrive structured). vLLM ships ~20 parser plugins (upstream/vllm/entrypoints/ openai/tool_parsers/) that all solve this lab with different tag conventions — and
their bug tracker is a museum of exactly the cases this lab's tests pin: tags split
across chunks leaking half-tags into chat UIs, held-back text swallowed forever on
false alarms, malformed JSON crashing streams instead of failing requests.
The streaming-equals-batch fuzz test is the lab's methodological gift: 50 random chunkings of the same text, all required to reassemble to the batch parse. Chunking invariance is the property every incremental parser owes, and randomized chunk boundaries are how you test it — the same move as Phase 8 lab-03's distributional oracle, applied to parsing.
Background: the two parsers
Batch (parse_tool_calls): scan for OPEN…CLOSE blocks, JSON-parse each,
return (remaining content, calls). Malformed JSON raises — a call the executor can't
parse must 4xx at the server, not detonate downstream (the loud-failure habit from
Phase 14 lab-03).
Streaming (StreamingToolParser): a buffer and one bit of state (in_block).
Outside a block, emit text eagerly except the longest trailing proper-prefix of
OPEN — the hold-back. Inside, buffer silently until CLOSE (partial JSON is never
parseable, so nothing useful can be emitted early), then parse and emit the call.
finish() flushes held text and makes an unterminated block loud — the
finish_reason: "length" interaction from Phase 12 lab-02, parser edition: a stream
truncated mid-call is an error, not a tool call.
Files
starter.py—parse_tool_callsandStreamingToolParser(feed/finish). Your work.solution.py— reference (note_trailing_tag_prefix: the hold-back, isolated).test_lab.py— batch semantics, the 50-chunking fuzz, the split-tag leak test, the false-alarm release, and the unterminated-block failure.
Run
LAB_IMPL=starter pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q
pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_batch_parse / test_multiple_calls_in_order | The structured extraction, content preserved around it, order kept |
test_malformed_json_is_loud | Garbage in a block raises — the server's chance to fail the request instead of the agent loop |
test_streaming_equals_batch_for_any_chunking | Chunking invariance, 50 random slicings — the incremental parser's defining property |
test_tag_split_across_chunks_is_not_leaked | "Sure. <tool" emits "Sure. " and holds "<tool" — half-tags never reach the user (the chat-UI-shows-<tool bug, prevented) |
test_false_alarm_prefix_is_released | "<to" + "day>" → "<today>" emitted intact — held-back is not swallowed (the opposite bug, equally real) |
test_unterminated_block_fails_at_finish | Truncation inside a call is an error, matching the Phase 12 hygiene rule |
Hitchhiker's notes
- Why per-model parsers at all? The tag convention is trained into each model
(Hermes, Mistral, Llama, Qwen each render tool calls differently in their chat
templates), so the parser must match the template —
--tool-call-parser hermespairs with the model the same way Phase 14's mapping table pairs with a checkpoint. Mismatched parser ⇒ tool calls stream as visible text: instantly recognizable once you've done this lab. - The OpenAI streaming contract adds a layer your events map onto: tool-call
deltas (
tool_calls[i].function.argumentsstreamed as JSON fragments). Real parsers emit partial-argument deltas for responsiveness — which requires incremental JSON parsing too (is this string complete? is the brace balanced?). Your buffer-until-close design is the correctness-first version; the delta-streaming upgrade is the going-further. - Constrained decoding (Phase 12) and parsing are complements, not rivals: the
grammar mask can guarantee the model emits well-formed
<tool_call>JSON (vLLM's tool-choice enforcement does exactly this), and the parser still must extract it from the stream. Guarantee the syntax, then parse it — belt and suspenders, both load-bearing. - The hold-back has a latency cost: a trailing
<waits one chunk before display. Imperceptible — but the general trade (display latency vs structural certainty) recurs in stop-string handling (Phase 1 lab-05's straddle problem) and reasoning-tag parsers. Same buffer discipline everywhere; vLLM's detokenizer and parsers share it.
Going further
- Add streaming argument deltas: inside a block, emit
("args_delta", fragment)events for completed JSON string portions — you'll need a brace/quote tracker (a mini Phase 12 lab-03 machine), and you'll understand why upstream parsers carry exactly one. - Implement a second convention (Mistral's
[TOOL_CALLS][{...}]) behind the same event interface, and aget_parser(name)registry — the plugin shape oftool_parsers/, reproduced. - Property-test with adversarial content: tool-call JSON whose string values
contain
</tool_call>. Your parser breaks (find -> escape-aware scanning). Upstream's do too, mostly — models are trained not to emit this, which is a contract worth knowing is social, not technical.
References
upstream/vllm/entrypoints/openai/tool_parsers/— the plugin zoo;hermes_tool_parser.pyis your lab with delta streaming.- vLLM docs, Tool Calling — parser selection and
--enable-auto-tool-choice: https://docs.vllm.ai/en/latest/features/tool_calling/ - OpenAI API reference, function calling & streaming — the contract being satisfied: https://platform.openai.com/docs/guides/function-calling
- Phase 12 — the masks that can guarantee what this lab parses; lab-03 — the same buffering discipline one level down, at the byte boundary.