Phase 18 — Interview Questions: Performance Engineering
Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. Throughput is low and GPU utilization is ~30% at batch size 1–2. What's happening?
Model answer
Almost certainly CPU-launch-bound decode: many tiny kernels per step, CPU can't feed the GPU. Enable CUDA graphs, increase batch size (raise max_num_seqs / accept more concurrency), and check for Python overhead on the hot path. Confirm with a profile showing gaps between kernels.
Q2. How do you decide max_num_batched_tokens and gpu_memory_utilization?
Model answer
max_num_batched_tokens trades prefill chunk size vs decode latency: bigger = better prefill throughput but can stall decodes; tune to your prompt/output mix. gpu_memory_utilization sets how much HBM the KV cache may use — raise it to fit more concurrent sequences, but leave headroom for activations/CUDA-graph buffers to avoid OOM.
Going deeper
The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.