Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 12 — Cheatsheet: Structured Outputs

Contents


The one-liner

Per step: grammar state → allowed-token bitmask → illegal logits = −inf → sample → advance state. Valid by construction; softmax renormalizes over legal tokens.

The pipeline

StructuredOutputsParams (one of json/regex/choice/grammar/json_object/structural_tag) → grammar_init compiles async (request unschedulable until Future resolves) → scheduler get_grammar_bitmask per step → manager fills rows → numpy → runner apply_grammar_bitmask reorders to batch order + fused −inf kernel → sample → accept_tokens advances.

Machines

  • regex → FSM · JSON/EBNF → pushdown (stack) · JSON Schema → compiled to grammar.
  • Char-rules lifted to tokens at compile time (xgrammar token classification) → packed bitmask, vocab/32 int32 words; runtime checks only context-dependent stragglers.

Performance model

  • Compile: once per distinct (type, spec) key; 10s–100s ms for big schemas; hits first request's TTFT only.
  • Per step: bitmask fill (CPU, parallelized above a batch threshold) + one fused mask kernel; low single-digit % overhead steady-state.

Spec-decode composition

  • Bitmask rows = max_num_seqs × (1 + num_spec_tokens) — one row per sampled position.
  • Fill row i after tentatively accepting drafts < i; then rollback(advancements).
  • Grammar interface has validate_tokens (check, no advance) + rollback(n); xgrammar matcher built with max_rollback_tokens=num_spec_tokens.

Gotchas

  • One backend per engine (xgrammar default; guidance/outlines/lm-format-enforcer), not per request.
  • Reasoning models: constraint suspended during thinking (should_advance gate; mask row set all-ones) until the reasoning parser signals end.
  • Valid ≠ true; and finish_reason="length" still truncates mid-structure — budget max_tokens for the schema's worst case.
  • Constrained + unconstrained drafter can tank spec acceptance — drafts get vetoed.

Key upstream

  • vllm/sampling_params.py:41 StructuredOutputsParams
  • v1/structured_output/backend_types.py:31 Grammar :99 Backend (the contract)
  • v1/structured_output/__init__.py:36 Manager :115 grammar_init :204 grammar_bitmask :322 should_advance
  • v1/structured_output/backend_xgrammar.py:77 compile_grammar :132 XgrammarGrammar
  • v1/structured_output/utils.py:44 apply_grammar_bitmask (runner side, gpu_model_runner.py:4359)
  • v1/core/sched/scheduler.py:1259 get_grammar_bitmask

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md