Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 16-01 — Tool-Call Parsing: Structure Out of a Token Stream [CPU-OK]

A tool-calling model doesn't emit function calls — it emits text that describes function calls (<tool_call>{"name": …}</tool_call> for Hermes-style models; [TOOL_CALLS] for Mistral; a dozen other conventions). The server's job is to turn that text into the OpenAI response's structured tool_calls field — and to do it while streaming, over chunks that can split the tag or the JSON anywhere. The batch parser is twenty easy lines; the streaming parser is where every real bug in vLLM's tool_parsers/ directory lives, and its central discipline is the lab's takeaway: hold back any text that might still become a tag — emit "Sure. " immediately, but keep "<tool" buffered until the next chunk says whether it's a tool call or the user's <today>.

Contents


Why this lab exists

Tool calling is the load-bearing feature of the agent era, and its serving-side reality is unglamorous: per-model text conventions, parsed incrementally, under the OpenAI API's streaming contract (content deltas must flow immediately; tool calls must arrive structured). vLLM ships ~20 parser plugins (upstream/vllm/entrypoints/ openai/tool_parsers/) that all solve this lab with different tag conventions — and their bug tracker is a museum of exactly the cases this lab's tests pin: tags split across chunks leaking half-tags into chat UIs, held-back text swallowed forever on false alarms, malformed JSON crashing streams instead of failing requests.

The streaming-equals-batch fuzz test is the lab's methodological gift: 50 random chunkings of the same text, all required to reassemble to the batch parse. Chunking invariance is the property every incremental parser owes, and randomized chunk boundaries are how you test it — the same move as Phase 8 lab-03's distributional oracle, applied to parsing.

Background: the two parsers

Batch (parse_tool_calls): scan for OPEN…CLOSE blocks, JSON-parse each, return (remaining content, calls). Malformed JSON raises — a call the executor can't parse must 4xx at the server, not detonate downstream (the loud-failure habit from Phase 14 lab-03).

Streaming (StreamingToolParser): a buffer and one bit of state (in_block). Outside a block, emit text eagerly except the longest trailing proper-prefix of OPEN — the hold-back. Inside, buffer silently until CLOSE (partial JSON is never parseable, so nothing useful can be emitted early), then parse and emit the call. finish() flushes held text and makes an unterminated block loud — the finish_reason: "length" interaction from Phase 12 lab-02, parser edition: a stream truncated mid-call is an error, not a tool call.

Files

  • starter.pyparse_tool_calls and StreamingToolParser (feed/finish). Your work.
  • solution.py — reference (note _trailing_tag_prefix: the hold-back, isolated).
  • test_lab.py — batch semantics, the 50-chunking fuzz, the split-tag leak test, the false-alarm release, and the unterminated-block failure.

Run

LAB_IMPL=starter pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q
pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q   # reference

What the tests prove

TestWhat it pins
test_batch_parse / test_multiple_calls_in_orderThe structured extraction, content preserved around it, order kept
test_malformed_json_is_loudGarbage in a block raises — the server's chance to fail the request instead of the agent loop
test_streaming_equals_batch_for_any_chunkingChunking invariance, 50 random slicings — the incremental parser's defining property
test_tag_split_across_chunks_is_not_leaked"Sure. <tool" emits "Sure. " and holds "<tool" — half-tags never reach the user (the chat-UI-shows-<tool bug, prevented)
test_false_alarm_prefix_is_released"<to" + "day>""<today>" emitted intact — held-back is not swallowed (the opposite bug, equally real)
test_unterminated_block_fails_at_finishTruncation inside a call is an error, matching the Phase 12 hygiene rule

Hitchhiker's notes

  • Why per-model parsers at all? The tag convention is trained into each model (Hermes, Mistral, Llama, Qwen each render tool calls differently in their chat templates), so the parser must match the template — --tool-call-parser hermes pairs with the model the same way Phase 14's mapping table pairs with a checkpoint. Mismatched parser ⇒ tool calls stream as visible text: instantly recognizable once you've done this lab.
  • The OpenAI streaming contract adds a layer your events map onto: tool-call deltas (tool_calls[i].function.arguments streamed as JSON fragments). Real parsers emit partial-argument deltas for responsiveness — which requires incremental JSON parsing too (is this string complete? is the brace balanced?). Your buffer-until-close design is the correctness-first version; the delta-streaming upgrade is the going-further.
  • Constrained decoding (Phase 12) and parsing are complements, not rivals: the grammar mask can guarantee the model emits well-formed <tool_call> JSON (vLLM's tool-choice enforcement does exactly this), and the parser still must extract it from the stream. Guarantee the syntax, then parse it — belt and suspenders, both load-bearing.
  • The hold-back has a latency cost: a trailing < waits one chunk before display. Imperceptible — but the general trade (display latency vs structural certainty) recurs in stop-string handling (Phase 1 lab-05's straddle problem) and reasoning-tag parsers. Same buffer discipline everywhere; vLLM's detokenizer and parsers share it.

Going further

  • Add streaming argument deltas: inside a block, emit ("args_delta", fragment) events for completed JSON string portions — you'll need a brace/quote tracker (a mini Phase 12 lab-03 machine), and you'll understand why upstream parsers carry exactly one.
  • Implement a second convention (Mistral's [TOOL_CALLS][{...}]) behind the same event interface, and a get_parser(name) registry — the plugin shape of tool_parsers/, reproduced.
  • Property-test with adversarial content: tool-call JSON whose string values contain </tool_call>. Your parser breaks (find -> escape-aware scanning). Upstream's do too, mostly — models are trained not to emit this, which is a contract worth knowing is social, not technical.

References