Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 12 — Exercises: Structured Outputs

Contents


Warm-up (explain)

  1. Why mask logits per step instead of generating freely and rejecting invalid outputs?
  2. What machine do you need for a regex, and what more do you need for JSON? Why exactly can't an FSM handle JSON?
  3. Grammars constrain characters but models emit tokens — state the lifting rule for "token T is allowed in state S", and why it's too slow to evaluate naively per step.
Solution sketches
  1. Rejection is unbounded (a 1k-token output that's 99.9% reliable per token fails ~63% of the time; retries multiply cost and latency, and there's no guarantee of ever succeeding). Masking is O(1) per step and makes invalid output impossible while letting softmax renormalize over the legal set, preserving the model's preferences among valid continuations.
  2. Regex → FSM (finite states, transition table). JSON → pushdown automaton: nesting depth is unbounded and an FSM has finite memory, so it cannot ensure every { gets its } — you need a stack (push on open, pop on close).
  3. T allowed in S iff running T's characters through the automaton from S never dies. Naive cost = vocab × token_len automaton steps each decode step (~hundreds of thousands). xgrammar precomputes per-context token verdicts at compile time into a packed bitmask, leaving only a small context-dependent set for runtime.

Core (trace the code)

  1. StructuredOutputManager.grammar_init (__init__.py:115) — what is submitted to the executor, what does the request's grammar field hold meanwhile, and what does the scheduler do with such a request?
  2. In grammar_bitmask (__init__.py:204), the serial path calls accept_tokens on draft tokens and later rollback. Why advance at all, and why must it rewind?
  3. apply_grammar_bitmask (utils.py:44) builds sorted_bitmask filled with -1. What does a -1 word mean, and why does the function need to reorder rows at all?
  4. Why does the bitmask allocation reserve max_num_seqs × (1 + num_speculative_tokens) rows rather than max_num_seqs?
Solution sketches
  1. _create_grammar (the backend compile_grammar call) goes to a ThreadPoolExecutor; the field holds a Future. The grammar property (request.py:60) returns None until resolved, and the scheduler skips the request — it waits, unscheduled, so compile latency hits only that request's TTFT, never the engine loop.
  2. The mask for draft position i must reflect the state after drafts 1..i−1 — so the grammar advances through the drafts to compute successive rows. But accept/reject belongs to the rejection sampler after the forward pass; the grammar must rewind (rollback(state_advancements)) so the real outcome can be applied later.
  3. -1 = all bits set = every token allowed (int32 of all 1s) — the rows for unconstrained requests. Reordering: the scheduler emitted rows in its own request order, but logits rows follow the runner's batch order with spec-token offsets; struct_out_req_batch_indices maps request → logit row.
  4. With spec decode, each request samples at up to 1 + k positions per step (k drafts + bonus/correction), and each position needs its own mask row, stored inline.

Build (your lab)

  1. In your lab-01 FSM, add a choice(["yes", "no", "maybe"]) constraint without writing a new engine — express it as a regex and confirm the adversarial model can only emit one of the three.
  2. Measure your mask-cache hit rate: generate 200 tokens under a 5-state FSM and count distinct (state → mask) computations vs lookups. Relate the result to why xgrammar compiles ahead of time.
  3. In lab-03's pushdown, construct an input where the same automaton state has different allowed sets depending on the stack. Why does this kill any pure-FSM implementation?
Solution sketches
  1. choice is just alternation: yes|no|maybe — same FSM machinery (this is literally how CHOICE is lowered upstream). Adversarial run must end with output ∈ the set.
  2. Distinct computations = number of reachable FSM states (≤ 5); everything after is a lookup, so hit rate → ~100% quickly. Ahead-of-time compilation is this cache computed eagerly for every state at compile time.
  3. State "expecting close bracket" with stack [{ must allow } not ]; with [[ it must allow ] not }. Same control state, different stack top ⇒ allowed set is a function of (state, stack), which an FSM cannot represent for unbounded depth.

Design (staff-level)

  1. A tenant sends 10k requests/minute, each with a unique (uncacheable) 50 KB JSON schema; compiles take 200 ms of CPU. The engine also serves latency-sensitive unconstrained traffic. What breaks, and what do you do about it (at least three layers of defense)?
  2. Product asks: "constrained decoding makes outputs worse — the JSON is valid but the content got dumber." Is that plausible? Explain the mechanism and two mitigations.
  3. Spec decode (k=4) + structured output: derive the per-step grammar-work overhead vs non-spec, and explain why acceptance rate changes under constraints (which direction?).
  4. Design grammar-aware prefix caching: two requests share a long prompt but have different schemas. What can be shared (Phase 2/3 machinery), what cannot, and where would you hook the distinction in?
Solution sketches
  1. The compile executor saturates → constrained TTFT explodes; if compile work steals scheduler-process CPU, everyone's step time degrades. Defenses: bound + isolate the compile pool (separate cores/process); per-tenant compile-queue quotas and admission control (reject/429 on queue depth); schema canonicalization to recover cache hits; pre-registration API for schemas; cap schema size/complexity at the front door (upstream already rejects unsupported features in the processor — same layer).
  2. Plausible. The mask renormalizes over legal tokens, but the model's distribution was conditioned on its own preferred phrasing; forcing rare continuations (e.g. token boundaries that split words awkwardly) pushes it off-manifold and degrades content. Mitigations: schema-aware prompting (show the schema, few-shot valid examples) so the constrained path is also the high-probability path; looser grammars (whitespace flexibility — note xgrammar's any_whitespace flag); two-pass (free-form reason, then constrained extraction — exactly what the reasoning-gate in should_advance enables).
  3. Per step the grammar does k+1 fill_bitmask calls + k accept_tokens + one rollback(k) vs 1 fill — so ~(k+1)× the grammar CPU, still small vs the forward pass. Acceptance rises on structural tokens (mask forces draft and target into the same narrow set — high agreement) and can fall if the drafter is unconstrained while the target is masked (drafts get vetoed by structure the drafter didn't know about; vLLM mitigates by validating drafts with validate_tokens).
  4. KV blocks for the shared prompt are shareable as ever — KV depends only on token ids (the mask touches logits, not hidden states). What's not shareable is grammar state and compiled grammars (different schemas). Hook: nothing to change in the block hash — grammar state only matters from the first generated token; just keep grammar identity out of prefix-cache keys for prompt blocks (and be careful only if constraint applies to prompt — it doesn't).

Self-grading

4–7 and 11–14 are interview-grade. Could you whiteboard the per-step pipeline (state → bits → −inf → sample → advance) and the spec-decode advance/rollback dance from memory? If not, re-read 01-deep-dive.md §4.