Phase 16 — Interview Questions: Serving APIs & Parsers

Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. How does vLLM implement tool calling on top of plain text generation?

Model answer

The server injects tool schemas into the prompt (often via the chat template / structured output), then a model-specific tool-call parser extracts the function name and JSON args from the generated text — incrementally during streaming — and emits OpenAI-style tool_calls. Structured output can hard-constrain the args to the schema.

Q2. What's tricky about streaming responses?

Model answer

You must emit incremental deltas while maintaining correct semantics (role, finish_reason), and parse partial content (tool-call JSON, reasoning tags) that spans multiple chunks without committing to an interpretation too early.

Going deeper

The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 16 — Interview Questions: Serving APIs & Parsers

Q1. How does vLLM implement tool calling on top of plain text generation?

Q2. What's tricky about streaming responses?

Going deeper