Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 18-01 — Tune the Knobs: an SLO-Constrained Grid Search [CPU-OK]

Performance engineering is one loop, run with discipline: define metrics, measure a workload under a config, search the config space under a constraint. This lab has you build the whole loop on mini_vllm — a simulator that runs an arrival schedule (requests landing at different times, the thing every previous lab simplified away) and emits three metrics: per-request TTFT, the worst step's token count (the ITL-spike proxy), and total steps (throughput). Then grid_search sweeps budget × chunk-threshold under a hard spike SLO and returns the best legal config — refusing loudly when no config qualifies, because a quietly violated SLO is the worst outcome in the trade. Along the way the tests teach two facts that surprise most tuners: the chunk threshold is per-request (two chunked prefills still stack in one step — only the budget caps globally), and chunking can cost zero throughput when a long decode stream's steps are already there to hide the chunks in.

Contents


Why this lab exists

Every knob in this course got its own lab; this one is where they meet a workload — and workloads, not knobs, are what you actually tune for. The arrival schedule is the lab's quiet upgrade over everything before it: queueing (TTFT now includes waiting — test_queueing_shows_up_in_ttft), interference between requests that arrive at different moments, and the SLO-vs-throughput tension that only exists when both matter at once. The simulator is deliberately the cheapest possible version of the loop (steps and token counts, no GPUs, milliseconds per run) because the methodology is the deliverable: lab-02 runs the identical loop with vllm bench and wall-clocks, and the only thing that changes is the cost of each measurement — which is exactly why you prototype the search cheap.

The grid search's design choices are the staff-engineer content: the SLO is a constraint, not a weighted term (latency SLOs are promises, not preferences); ties break toward lower worst-TTFT (when throughput is equal, take the latency); and unsatisfiability raises (test_unsatisfiable_slo_is_loud) — the tuning loop's version of the course's loud-failure habit, because "best effort" on an impossible SLO ships a violation with extra steps.

The three metrics, and what each proxies:

  • ttft_steps — steps from arrival (not admission!) to first token. Queueing, scheduling, prefill: all of it. The user-facing wait.
  • max_step_tokens — the worst step's total scheduled tokens ≈ the worst inter-token stall any decoding user felt (Phase 3 lab-05's proxy, now a tunable's objective).
  • total_steps — the schedule's length ≈ inverse throughput at fixed step cost. (The proxy's known bias: real steps' wall-clock varies with their token count — total tokens would weight differently; lab-02's wall-clocks settle it.)

The two knobs swept are the course's latency dial (threshold) and throughput dial (budget) — and the search space is tiny on purpose. Real tuning fails far more often from unclear objectives than from undersized grids; get the constraint and the tiebreak right first, then enlarge the grid.

Files

  • starter.pyMetrics, simulate (arrivals + the Phase 1 lab-04 probe + first-token tracking), grid_search. Your work.
  • solution.py — reference.
  • test_lab.py — monotonicity, the per-request-vs-global cap lesson, queueing in TTFT, arrival-relative measurement, SLO compliance, and loud unsatisfiability.

Run

LAB_IMPL=starter pytest phase-18-performance-engineering/labs/lab-01-tune-the-knobs -q
pytest phase-18-performance-engineering/labs/lab-01-tune-the-knobs -q   # reference

What the tests prove

TestWhat it pins
test_throughput_more_budget_never_more_stepsThe sanity direction every tuning loop needs before it can be trusted with anything subtle
test_spike_threshold_is_per_request_budget_is_globalThe two-cap structure: threshold=32 still allows a 65-token step (two chunks + a decode); only budget=40 forces ≤ 40. And the surprise: chunking cost zero steps here, because the 24-token decode stream's steps were already there to hide chunks in — Sarathi's piggybacking, measured from the throughput side. A lonely fat prompt, with nowhere to hide, pays the full chunking step-count bill
test_queueing_shows_up_in_ttftmax_num_seqs=1 makes later arrivals wait, and the metric sees it — TTFT without queueing is a benchmark fiction
test_ttft_is_measured_from_arrivalThe zero-point check: an idle engine serves first tokens in the arrival step. Metrics need calibration tests too
test_grid_search_respects_the_sloThe constrained search refuses the throughput-optimal-but-violating config — constraints beat preferences
test_unsatisfiable_slo_is_loudAn impossible SLO raises; it does not return the least-bad violation

Hitchhiker's notes

  • The hide-the-chunks result generalizes and matters: chunked prefill's throughput cost is max(0, chunk_steps − coexisting_decode_steps)-shaped. Fleets with deep decode streams (chat) chunk nearly free; bursty prefill-only fleets (batch summarization) pay full price — and that's also the fleet that didn't need the latency protection. The knob's cost and its benefit anti-correlate across workloads, which is why per-deployment tuning beats global defaults.
  • Arrival schedules are the difference between benchmarks and reality: this lab's three-request workload already produces queueing, interference, and hiding effects no all-at-once batch shows. Real benchmark suites (vllm bench serve) generate Poisson arrivals at a target QPS for the same reason — lab-02 uses exactly that.
  • Grid search is the right first search: 6 configs here, exhaustive, done. At real scale (5+ knobs), the same loop wraps Bayesian/successive-halving optimizers — but the metrics, the constraint handling, and the loud unsatisfiability transfer unchanged. The loop is the asset; the optimizer is a plug-in.
  • One proxy limitation to carry consciously: step counts can't see fixed per-step overheads (launch costs, scheduler time), so this simulator systematically favors many-small-steps configs vs what wall-clocks will say — Phase 5's whole subject is that bias. Cheap models with known biases, again (Phase 8 lab-04, Phase 15 lab-03): use them to shrink the expensive search, never to replace the final measurement.

Going further

  • Add a worst_ttft SLO as a second constraint and find workloads where the two SLOs conflict (spike cap wants small budget; TTFT wants big) — the multi-objective frontier, met honestly.
  • Generate Poisson arrivals (rng.poisson) at increasing rates and plot worst-TTFT vs offered load for two configs: the hockey stick where queueing takes over is the capacity limit, found by simulation — Phase 3 lab-04's going-further, completed.
  • Port simulate's probe to count tokens per step and weight total_steps by a per-step cost model (fixed + per-token) — calibrate the two constants against one lab-02 measurement, then re-run the grid. You've built the cheap-model/ expensive-measurement two-tier loop production tuning actually uses.

References

  • Phase 1 lab-04 (the probe), Phase 3 labs 01/05 (the two caps and the spike) — the parts this lab assembles.
  • upstream/vllm/benchmarks/ and vllm bench — the production version of this loop (lab-02).
  • vLLM docs, Optimization and Tuning — the knobs' official guidance, now checkable against your own search: https://docs.vllm.ai/en/latest/configuration/optimization.html
  • Agrawal et al., Sarathi-Serve (OSDI 2024) — the piggybacking result your zero-cost-chunking test measured: https://arxiv.org/abs/2403.02310