Phase 12 — The Hitchhiker's Guide to Structured Outputs
← Phase 11 · Course home · Phase 13 →
Contents
- Don't Panic
- Step 1: The problem — "please output JSON" is a prayer, not a guarantee
- Step 2: The idea — make illegal tokens impossible, not unlikely
- Step 3: From spec to automaton — why JSON needs a stack
- Step 4: Characters vs tokens — the lifting problem
- Step 5: The per-step pipeline in vLLM
- Step 6: The costs, and where they're hidden
- The invariants to memorize
- What you'll do
Don't Panic
Sometimes the output must be valid JSON, or match a regex, or follow a grammar — because a machine, not a human, is going to parse it. vLLM enforces this by computing, every decode step, the set of tokens that are legal next under the grammar, and setting every other token's logit to −inf before sampling. The model literally cannot emit an invalid token. No retries, no "I asked nicely," no post-hoc repair. The whole phase is that one move — mask, then sample — plus the machinery that makes it correct (a grammar automaton per request) and fast (a precompiled token bitmask per step).
every step, per constrained request
┌────────────────────────────────────┐
grammar ──►│ automaton state ──► allowed tokens │──► bitmask (vocab bits)
└────────────────────────────────────┘ │
model ────────────────► logits [vocab] ──── mask (−inf) ◄────┘
│
▼
sample → token ──► advance automaton state
Step 1: The problem — "please output JSON" is a prayer, not a guarantee
A sampler picks from a probability distribution over ~100k tokens. Even a model that is 99.9%
"JSON-reliable" per token will break a 1,000-token response about once per response. Tool
calls, agents, extraction pipelines — anything that feeds model output to json.loads or an
API schema — needs 0% failure, by construction. Prompting harder changes the probabilities;
constrained decoding changes the support of the distribution. That's the difference between
unlikely and impossible.
Step 2: The idea — make illegal tokens impossible, not unlikely
At any point mid-generation, the text so far puts the grammar in some state: "inside a
string", "after a {, expecting a key or }", "just closed the top-level object — done".
Each state defines exactly which next characters are legal. So:
- Track the grammar state as tokens are emitted.
- Before each sample, compute the allowed set for the current state.
- Mask:
logits[token] = −inffor every token not in the set. Softmax renormalizes the probability over the legal tokens — the model still chooses which legal token, with its own preferences intact. - After sampling, advance the state by the chosen token.
Note what this is not: it's not generate-then-validate (wasteful, unbounded retries), and it's not beam-searching for valid outputs (expensive). It's O(1 mask) per step, exact by construction.
Step 3: From spec to automaton — why JSON needs a stack
What machine tracks "the state"? Depends on the language class:
- Regex → a finite-state machine (FSM). Finitely many states, a transition table
next = δ[state][char]. Your lab-01 builds exactly this. - JSON / EBNF grammars → nesting is unbounded (
[[[[…]]]]), and an FSM cannot count. You need a pushdown automaton: a state plus a stack (push on{/[, pop on}/]). Your lab-03 builds exactly this, and it's what xgrammar implements for real. - JSON Schema → compiled into such a grammar first ("key
namethen a string, keyagethen an integer…"). Schema → grammar → pushdown automaton is the production pipeline.
regex ──compile──► FSM (states × chars table)
JSON-schema──compile──► CFG ──► pushdown automaton (state + stack)
Step 4: Characters vs tokens — the lifting problem
Grammars speak characters; the model emits tokens (multi-character chunks like
{"name). A token is legal iff feeding its characters one-by-one through the automaton
succeeds from the current state. Naively that's vocab_size × token_len automaton steps —
per decode step. The production answer (xgrammar's core trick) is to do the expensive
analysis at compile time: for each automaton context, precompute which tokens are
context-independent (always legal / always illegal) and store them in a compressed
token bitmask (vocab_size / 32 int32 words); only a small context-dependent remainder
(tokens that interact with the stack) is checked at runtime. That's why the per-step cost is
"fill a bitmask," not "simulate 100k tokens."
Step 5: The per-step pipeline in vLLM
The flow at the pinned commit (deep-dive walks every hop):
SamplingParams.structured_outputs {json=…, regex=…, choice=…, grammar=…}
│ (request arrives)
▼
StructuredOutputManager.grammar_init compile grammar — async, off the hot path
│ (request not schedulable until compiled — a new WAITING substate)
▼
Scheduler.get_grammar_bitmask each step: collect constrained requests
│ → manager fills one bitmask row per request
▼ GrammarOutput (numpy bitmask, serialized to workers)
gpu_model_runner: apply_grammar_bitmask reorder rows to batch order, −inf via xgr kernel
▼
sample → accepted tokens → grammar.accept_tokens() advances the automaton
Two production wrinkles worth noticing now (the deep-dive shows the code):
- Compilation is async (
ThreadPoolExecutor): a request whose grammar is still compiling simply isn't scheduled yet. Compile cost never blocks the engine loop. - Speculative decoding composes with this (Phase 8): the bitmask tensor holds one row per
position (each draft token + the bonus token), and the grammar exposes
rollback(n)so rejected drafts un-advance the automaton. Constraint + speculation, no special case.
Step 6: The costs, and where they're hidden
| Cost | When | Hidden how |
|---|---|---|
| Grammar compile (schema → automaton + token tables) | once per distinct grammar | async executor; cached by (type, spec) key |
| Bitmask fill (automaton state → vocab bits) | per request, per step | compile-time token classification; parallel fill above a batch threshold |
| Mask apply (−inf on logits) | per step | one fused GPU kernel over the batch |
| State advance | per accepted token | trivial (table/stack step) |
The first request with a new big schema pays a visible TTFT hit (lab-02 measures it on real vLLM). Steady-state per-step overhead is small single-digit percent.
The invariants to memorize
- Constrained decoding = mask then sample: illegal tokens get −inf; softmax renormalizes over the legal set. Output is valid by construction.
- The machine matching the language: regex → FSM; JSON/EBNF → pushdown (stack); JSON Schema compiles down to the latter.
- Grammars speak chars, models speak tokens — the token bitmask is the precompiled
lifting of char-rules to vocab entries (
vocab/32int32 words per position). - One grammar state per request, advanced on accept, rolled back on spec-decode rejection. Compile happens once per distinct grammar, off the hot path.
- Honest caveat: constraints guarantee validity, not truth — and
max_tokenscan still truncate mid-structure (finish_reason="length"), which no mask can save you from.
What you'll do
- Read: 01-deep-dive.md — the manager, the backend interface, xgrammar, and the scheduler/runner hops, all line-anchored.
- Build: 02-mini-build.md — a regex-FSM grammar mask as a
mini_vllm/grammar.pylogits processor (reference implementation + tests included). - Labs (see labs/README.md; recommended order 01 → 03 → 02):
lab-01-regex-fsm-mask[CPU-OK]— compile a regex to an FSM, lift it to per-step token masks, and prove an adversarial model still always matches the regex.lab-03-json-pushdown[CPU-OK]— why regex isn't enough: the stack-aware pushdown mask for a JSON subset; a brace-hating model still emits parseable JSON at depth 8.lab-02-json-schema-constrained[GPU-OPT]— xgrammar viaguided_jsonon real vLLM: 31/50 → 50/50 schema validity, first-request compile latency, thefinish_reason="length"trap. Captured output included.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 11 · Course home · Phase 13 →