Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 12 — The Hitchhiker's Guide to Structured Outputs

Phase 11 · Course home · Phase 13

Contents


Don't Panic

Sometimes the output must be valid JSON, or match a regex, or follow a grammar — because a machine, not a human, is going to parse it. vLLM enforces this by computing, every decode step, the set of tokens that are legal next under the grammar, and setting every other token's logit to −inf before sampling. The model literally cannot emit an invalid token. No retries, no "I asked nicely," no post-hoc repair. The whole phase is that one move — mask, then sample — plus the machinery that makes it correct (a grammar automaton per request) and fast (a precompiled token bitmask per step).

            every step, per constrained request
            ┌────────────────────────────────────┐
 grammar ──►│ automaton state ──► allowed tokens │──► bitmask (vocab bits)
            └────────────────────────────────────┘            │
 model ────────────────► logits [vocab] ──── mask (−inf) ◄────┘
                                   │
                                   ▼
                               sample → token ──► advance automaton state

Step 1: The problem — "please output JSON" is a prayer, not a guarantee

A sampler picks from a probability distribution over ~100k tokens. Even a model that is 99.9% "JSON-reliable" per token will break a 1,000-token response about once per response. Tool calls, agents, extraction pipelines — anything that feeds model output to json.loads or an API schema — needs 0% failure, by construction. Prompting harder changes the probabilities; constrained decoding changes the support of the distribution. That's the difference between unlikely and impossible.

Step 2: The idea — make illegal tokens impossible, not unlikely

At any point mid-generation, the text so far puts the grammar in some state: "inside a string", "after a {, expecting a key or }", "just closed the top-level object — done". Each state defines exactly which next characters are legal. So:

  1. Track the grammar state as tokens are emitted.
  2. Before each sample, compute the allowed set for the current state.
  3. Mask: logits[token] = −inf for every token not in the set. Softmax renormalizes the probability over the legal tokens — the model still chooses which legal token, with its own preferences intact.
  4. After sampling, advance the state by the chosen token.

Note what this is not: it's not generate-then-validate (wasteful, unbounded retries), and it's not beam-searching for valid outputs (expensive). It's O(1 mask) per step, exact by construction.

Step 3: From spec to automaton — why JSON needs a stack

What machine tracks "the state"? Depends on the language class:

  • Regex → a finite-state machine (FSM). Finitely many states, a transition table next = δ[state][char]. Your lab-01 builds exactly this.
  • JSON / EBNF grammars → nesting is unbounded ([[[[…]]]]), and an FSM cannot count. You need a pushdown automaton: a state plus a stack (push on {/[, pop on }/]). Your lab-03 builds exactly this, and it's what xgrammar implements for real.
  • JSON Schema → compiled into such a grammar first ("key name then a string, key age then an integer…"). Schema → grammar → pushdown automaton is the production pipeline.
 regex      ──compile──►  FSM            (states × chars table)
 JSON-schema──compile──►  CFG ──►  pushdown automaton (state + stack)

Step 4: Characters vs tokens — the lifting problem

Grammars speak characters; the model emits tokens (multi-character chunks like {"name). A token is legal iff feeding its characters one-by-one through the automaton succeeds from the current state. Naively that's vocab_size × token_len automaton steps — per decode step. The production answer (xgrammar's core trick) is to do the expensive analysis at compile time: for each automaton context, precompute which tokens are context-independent (always legal / always illegal) and store them in a compressed token bitmask (vocab_size / 32 int32 words); only a small context-dependent remainder (tokens that interact with the stack) is checked at runtime. That's why the per-step cost is "fill a bitmask," not "simulate 100k tokens."

Step 5: The per-step pipeline in vLLM

The flow at the pinned commit (deep-dive walks every hop):

 SamplingParams.structured_outputs        {json=…, regex=…, choice=…, grammar=…}
        │ (request arrives)
        ▼
 StructuredOutputManager.grammar_init     compile grammar — async, off the hot path
        │  (request not schedulable until compiled — a new WAITING substate)
        ▼
 Scheduler.get_grammar_bitmask            each step: collect constrained requests
        │                                  → manager fills one bitmask row per request
        ▼  GrammarOutput (numpy bitmask, serialized to workers)
 gpu_model_runner: apply_grammar_bitmask  reorder rows to batch order, −inf via xgr kernel
        ▼
 sample → accepted tokens → grammar.accept_tokens() advances the automaton

Two production wrinkles worth noticing now (the deep-dive shows the code):

  • Compilation is async (ThreadPoolExecutor): a request whose grammar is still compiling simply isn't scheduled yet. Compile cost never blocks the engine loop.
  • Speculative decoding composes with this (Phase 8): the bitmask tensor holds one row per position (each draft token + the bonus token), and the grammar exposes rollback(n) so rejected drafts un-advance the automaton. Constraint + speculation, no special case.

Step 6: The costs, and where they're hidden

CostWhenHidden how
Grammar compile (schema → automaton + token tables)once per distinct grammarasync executor; cached by (type, spec) key
Bitmask fill (automaton state → vocab bits)per request, per stepcompile-time token classification; parallel fill above a batch threshold
Mask apply (−inf on logits)per stepone fused GPU kernel over the batch
State advanceper accepted tokentrivial (table/stack step)

The first request with a new big schema pays a visible TTFT hit (lab-02 measures it on real vLLM). Steady-state per-step overhead is small single-digit percent.

The invariants to memorize

  1. Constrained decoding = mask then sample: illegal tokens get −inf; softmax renormalizes over the legal set. Output is valid by construction.
  2. The machine matching the language: regex → FSM; JSON/EBNF → pushdown (stack); JSON Schema compiles down to the latter.
  3. Grammars speak chars, models speak tokens — the token bitmask is the precompiled lifting of char-rules to vocab entries (vocab/32 int32 words per position).
  4. One grammar state per request, advanced on accept, rolled back on spec-decode rejection. Compile happens once per distinct grammar, off the hot path.
  5. Honest caveat: constraints guarantee validity, not truth — and max_tokens can still truncate mid-structure (finish_reason="length"), which no mask can save you from.

What you'll do

  • Read: 01-deep-dive.md — the manager, the backend interface, xgrammar, and the scheduler/runner hops, all line-anchored.
  • Build: 02-mini-build.md — a regex-FSM grammar mask as a mini_vllm/grammar.py logits processor (reference implementation + tests included).
  • Labs (see labs/README.md; recommended order 01 → 03 → 02):
    • lab-01-regex-fsm-mask [CPU-OK] — compile a regex to an FSM, lift it to per-step token masks, and prove an adversarial model still always matches the regex.
    • lab-03-json-pushdown [CPU-OK] — why regex isn't enough: the stack-aware pushdown mask for a JSON subset; a brace-hating model still emits parseable JSON at depth 8.
    • lab-02-json-schema-constrained [GPU-OPT] — xgrammar via guided_json on real vLLM: 31/50 → 50/50 schema validity, first-request compile latency, the finish_reason="length" trap. Captured output included.
  • Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

Phase 11 · Course home · Phase 13