Phase 12 — Cheatsheet: Structured Outputs
Contents
The one-liner
Per step: grammar state → allowed-token bitmask → illegal logits = −inf → sample → advance state. Valid by construction; softmax renormalizes over legal tokens.
The pipeline
StructuredOutputsParams (one of json/regex/choice/grammar/json_object/structural_tag) →
grammar_init compiles async (request unschedulable until Future resolves) →
scheduler get_grammar_bitmask per step → manager fills rows → numpy → runner
apply_grammar_bitmask reorders to batch order + fused −inf kernel → sample →
accept_tokens advances.
Machines
- regex → FSM · JSON/EBNF → pushdown (stack) · JSON Schema → compiled to grammar.
- Char-rules lifted to tokens at compile time (xgrammar token classification) →
packed bitmask,
vocab/32int32 words; runtime checks only context-dependent stragglers.
Performance model
- Compile: once per distinct
(type, spec)key; 10s–100s ms for big schemas; hits first request's TTFT only. - Per step: bitmask fill (CPU, parallelized above a batch threshold) + one fused mask kernel; low single-digit % overhead steady-state.
Spec-decode composition
- Bitmask rows =
max_num_seqs × (1 + num_spec_tokens)— one row per sampled position. - Fill row i after tentatively accepting drafts < i; then
rollback(advancements). - Grammar interface has
validate_tokens(check, no advance) +rollback(n); xgrammar matcher built withmax_rollback_tokens=num_spec_tokens.
Gotchas
- One backend per engine (xgrammar default; guidance/outlines/lm-format-enforcer), not per request.
- Reasoning models: constraint suspended during thinking (
should_advancegate; mask row set all-ones) until the reasoning parser signals end. - Valid ≠ true; and
finish_reason="length"still truncates mid-structure — budgetmax_tokensfor the schema's worst case. - Constrained + unconstrained drafter can tank spec acceptance — drafts get vetoed.
Key upstream
vllm/sampling_params.py:41 StructuredOutputsParamsv1/structured_output/backend_types.py:31 Grammar :99 Backend(the contract)v1/structured_output/__init__.py:36 Manager :115 grammar_init :204 grammar_bitmask :322 should_advancev1/structured_output/backend_xgrammar.py:77 compile_grammar :132 XgrammarGrammarv1/structured_output/utils.py:44 apply_grammar_bitmask(runner side,gpu_model_runner.py:4359)v1/core/sched/scheduler.py:1259 get_grammar_bitmask
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md