Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 18 Labs — Performance Engineering

Two labs, one loop: define metrics → measure a workload under a config → search under an SLO constraint. First built cheap — a simulator over mini_vllm with arrival schedules, spike proxies, and a grid search that refuses impossible SLOs (lab-01) — then run for real with vllm bench serve, wall-clocks, percentile distributions, and the tuning report as the deliverable artifact (lab-02).

CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-18-performance-engineering/labs -m "not gpu"

# Grade yourself:
LAB_IMPL=starter pytest phase-18-performance-engineering/labs/lab-01-tune-the-knobs -q

Contents


Labs

lab-01-tune-the-knobs [CPU-OK]

The loop, built cheap: a simulator running arrival schedules (queueing finally enters the course) with three metrics — TTFT-from-arrival, worst-step tokens (the spike proxy), total steps — and an SLO-constrained grid search that breaks ties toward latency and raises on unsatisfiable SLOs. Two measured surprises: the chunk threshold is per-request (only the budget caps a step globally), and chunking can cost zero throughput when decode steps are already there to hide chunks in. Skills: constraints beat preferences; metric calibration tests; cheap models with known biases shrink expensive searches.

lab-02-benchmark-real-vllm [GPU-OPT]

The loop, run for real: vllm bench serve sweeps with warm servers, two runs per config, percentiles everywhere, the rate sweep that finds the knee first — and the tuning report as the artifact (workload, table, distributions, recommendation with its trade named). The captured sweep reconciles every row against the CPU labs that predicted it. Skills: the four methodology checks; conservation of suffering (admission knobs relocate latency); macro before micro; benchmark at the knee.

What you can do after this phase

Run a tuning engagement end to end: state the workload, find the knee, sweep one knob at a time with honest variance, report distributions, and recommend with the trade named — having prototyped the search cheaply enough to afford it. You can also audit anyone else's benchmark in about a minute (workload? warm? percentiles? one knob?), which is its own kind of superpower. Phase 19 sends everything upstream.