Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 12 — Interview Questions: Structured Outputs

Staff/principal-level questions. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. How does vLLM guarantee valid JSON output?

Model answer

It compiles the JSON schema into a grammar automaton (xgrammar: a pushdown automaton, since JSON nests), and at each decode step computes a bitmask of tokens that keep the output grammar-valid, applying −inf to all illegal logits before sampling. Softmax renormalizes over the legal set, so the model still expresses preference among valid tokens — but invalid output is impossible by construction, not just unlikely. One grammar state per request, advanced as tokens are accepted.

Q2. Grammars are over characters; models emit multi-character tokens. How is that bridged efficiently?

Model answer

A token is legal iff its character sequence keeps the automaton alive from the current state. Checking that naively is vocab × token-length automaton steps per decode step. The production trick (xgrammar) is compile-time token classification: for each automaton context, precompute which tokens are unconditionally legal/illegal and pack them into a bitmask (vocab/32 int32 words); only a small context-dependent remainder (e.g. tokens interacting with the stack) is checked at runtime. Per-step cost becomes "copy a precomputed row + check a few stragglers."

Q3. Where does grammar compilation run, and why does that placement matter?

Model answer

On the scheduler side, in a thread-pool executor (StructuredOutputManager.grammar_init). The request's grammar field holds a Future, and the scheduler won't schedule the request until it resolves — so a 200 ms schema compile costs that request TTFT but never stalls the engine loop or other requests' steps. Compiled grammars are keyed by (type, spec) so repeated schemas don't recompile. The bitmask is also filled scheduler-side and shipped to workers as numpy (cheap serialization); the GPU only applies it.

Q4. How do structured outputs compose with speculative decoding?

Model answer

The bitmask carries one row per sampled position: for k draft tokens you need masks for the state before each draft plus the bonus position, so allocation is max_num_seqs × (1+k) rows. To compute row i the grammar tentatively accept_tokens's drafts 1..i−1, and after filling all rows it rollback's — the true accept/reject verdict belongs to the rejection sampler post-forward, which then advances the grammar by only the accepted prefix. rollback is part of the base grammar interface and xgrammar's matcher is constructed with max_rollback_tokens=num_speculative_tokens — composition was designed in, not patched on.

Q5. A customer says constrained outputs are "valid but dumber." Diagnose.

Model answer

Real effect. Masking renormalizes but the model wasn't conditioned to follow this grammar; when its preferred continuation is illegal, probability mass shifts to tokens it would rarely choose, pushing generation off-distribution (worst at awkward token boundaries and rigid whitespace rules). Mitigations: prompt with the schema and examples so the high-probability path is also the legal path; relax the grammar where harmless (xgrammar's any_whitespace); generate reasoning unconstrained and only constrain the final answer — vLLM's reasoning gate (should_advance) does exactly this for thinking models. Also check finish_reason="length": truncation, not the mask, is the most common "broken JSON" report.

Q6. Engine design: would you support different grammar backends per request? What does vLLM do and why?

Model answer

vLLM V1 deliberately supports one backend per engine (first constrained request picks it; see the NOTE in grammar_init). Per-request backends would mean multiple compiled-grammar caches, multiple bitmask allocation schemes, and validating every request's spec against every backend's feature matrix — for little gain since backends are interchangeable behind the two-interface contract (StructuredOutputBackend / StructuredOutputGrammar). The right extension point is the contract, not per-request dispatch: that's also how attention and quantization backends are handled.

Rapid-fire

  • Mask applied where? Logits, −inf, pre-softmax (one fused kernel, apply_grammar_bitmask).
  • Regex needs? FSM. JSON needs? Pushdown (stack). Schema → ? compiled to grammar first.
  • Bitmask row size? vocab_size / 32 int32 words. All -1 row = ? unconstrained.
  • Compile blocking the engine loop? Never — async executor, request waits unscheduled.
  • Spec-decode hooks in the grammar interface? validate_tokens, rollback(n).
  • What constraints can't fix: truth of content, and max_tokens truncation.