Phase 12 — Exercises: Structured Outputs
Contents
Warm-up (explain)
- Why mask logits per step instead of generating freely and rejecting invalid outputs?
- What machine do you need for a regex, and what more do you need for JSON? Why exactly can't an FSM handle JSON?
- Grammars constrain characters but models emit tokens — state the lifting rule for "token T is allowed in state S", and why it's too slow to evaluate naively per step.
Solution sketches
- Rejection is unbounded (a 1k-token output that's 99.9% reliable per token fails ~63% of the time; retries multiply cost and latency, and there's no guarantee of ever succeeding). Masking is O(1) per step and makes invalid output impossible while letting softmax renormalize over the legal set, preserving the model's preferences among valid continuations.
- Regex → FSM (finite states, transition table). JSON → pushdown automaton: nesting depth
is unbounded and an FSM has finite memory, so it cannot ensure every
{gets its}— you need a stack (push on open, pop on close). - T allowed in S iff running T's characters through the automaton from S never dies. Naive cost = vocab × token_len automaton steps each decode step (~hundreds of thousands). xgrammar precomputes per-context token verdicts at compile time into a packed bitmask, leaving only a small context-dependent set for runtime.
Core (trace the code)
StructuredOutputManager.grammar_init(__init__.py:115) — what is submitted to the executor, what does the request'sgrammarfield hold meanwhile, and what does the scheduler do with such a request?- In
grammar_bitmask(__init__.py:204), the serial path callsaccept_tokenson draft tokens and laterrollback. Why advance at all, and why must it rewind? apply_grammar_bitmask(utils.py:44) buildssorted_bitmaskfilled with-1. What does a-1word mean, and why does the function need to reorder rows at all?- Why does the bitmask allocation reserve
max_num_seqs × (1 + num_speculative_tokens)rows rather thanmax_num_seqs?
Solution sketches
_create_grammar(the backendcompile_grammarcall) goes to aThreadPoolExecutor; the field holds aFuture. Thegrammarproperty (request.py:60) returns None until resolved, and the scheduler skips the request — it waits, unscheduled, so compile latency hits only that request's TTFT, never the engine loop.- The mask for draft position i must reflect the state after drafts 1..i−1 — so the
grammar advances through the drafts to compute successive rows. But accept/reject
belongs to the rejection sampler after the forward pass; the grammar must rewind
(
rollback(state_advancements)) so the real outcome can be applied later. -1= all bits set = every token allowed (int32 of all 1s) — the rows for unconstrained requests. Reordering: the scheduler emitted rows in its own request order, but logits rows follow the runner's batch order with spec-token offsets;struct_out_req_batch_indicesmaps request → logit row.- With spec decode, each request samples at up to
1 + kpositions per step (k drafts + bonus/correction), and each position needs its own mask row, stored inline.
Build (your lab)
- In your lab-01 FSM, add a
choice(["yes", "no", "maybe"])constraint without writing a new engine — express it as a regex and confirm the adversarial model can only emit one of the three. - Measure your mask-cache hit rate: generate 200 tokens under a 5-state FSM and count distinct (state → mask) computations vs lookups. Relate the result to why xgrammar compiles ahead of time.
- In lab-03's pushdown, construct an input where the same automaton state has different allowed sets depending on the stack. Why does this kill any pure-FSM implementation?
Solution sketches
choiceis just alternation:yes|no|maybe— same FSM machinery (this is literally how CHOICE is lowered upstream). Adversarial run must end with output ∈ the set.- Distinct computations = number of reachable FSM states (≤ 5); everything after is a lookup, so hit rate → ~100% quickly. Ahead-of-time compilation is this cache computed eagerly for every state at compile time.
- State "expecting close bracket" with stack
[{must allow}not]; with[[it must allow]not}. Same control state, different stack top ⇒ allowed set is a function of (state, stack), which an FSM cannot represent for unbounded depth.
Design (staff-level)
- A tenant sends 10k requests/minute, each with a unique (uncacheable) 50 KB JSON schema; compiles take 200 ms of CPU. The engine also serves latency-sensitive unconstrained traffic. What breaks, and what do you do about it (at least three layers of defense)?
- Product asks: "constrained decoding makes outputs worse — the JSON is valid but the content got dumber." Is that plausible? Explain the mechanism and two mitigations.
- Spec decode (k=4) + structured output: derive the per-step grammar-work overhead vs non-spec, and explain why acceptance rate changes under constraints (which direction?).
- Design grammar-aware prefix caching: two requests share a long prompt but have different schemas. What can be shared (Phase 2/3 machinery), what cannot, and where would you hook the distinction in?
Solution sketches
- The compile executor saturates → constrained TTFT explodes; if compile work steals scheduler-process CPU, everyone's step time degrades. Defenses: bound + isolate the compile pool (separate cores/process); per-tenant compile-queue quotas and admission control (reject/429 on queue depth); schema canonicalization to recover cache hits; pre-registration API for schemas; cap schema size/complexity at the front door (upstream already rejects unsupported features in the processor — same layer).
- Plausible. The mask renormalizes over legal tokens, but the model's distribution was
conditioned on its own preferred phrasing; forcing rare continuations (e.g. token
boundaries that split words awkwardly) pushes it off-manifold and degrades content.
Mitigations: schema-aware prompting (show the schema, few-shot valid examples) so the
constrained path is also the high-probability path; looser grammars (whitespace
flexibility — note xgrammar's
any_whitespaceflag); two-pass (free-form reason, then constrained extraction — exactly what the reasoning-gate inshould_advanceenables). - Per step the grammar does k+1 fill_bitmask calls + k accept_tokens + one rollback(k)
vs 1 fill — so ~(k+1)× the grammar CPU, still small vs the forward pass. Acceptance
rises on structural tokens (mask forces draft and target into the same narrow set —
high agreement) and can fall if the drafter is unconstrained while the target is
masked (drafts get vetoed by structure the drafter didn't know about; vLLM mitigates by
validating drafts with
validate_tokens). - KV blocks for the shared prompt are shareable as ever — KV depends only on token ids (the mask touches logits, not hidden states). What's not shareable is grammar state and compiled grammars (different schemas). Hook: nothing to change in the block hash — grammar state only matters from the first generated token; just keep grammar identity out of prefix-cache keys for prompt blocks (and be careful only if constraint applies to prompt — it doesn't).
Self-grading
4–7 and 11–14 are interview-grade. Could you whiteboard the per-step pipeline (state → bits → −inf → sample → advance) and the spec-decode advance/rollback dance from memory? If not, re-read 01-deep-dive.md §4.