vLLM & Performance Tuning
Existing Slides
-
PSAP LLM Performance Benchmarking - July 11 2025
https://docs.google.com/presentation/d/1IXReNsWRUcy1C9nGsnnhkG_H-OG5UQ2nYS2KmrXr340/edit?usp=sharing
Existing lab resources
-
Training: vLLM Master Class:
https://redhat-ai-services.github.io/vllm-showroom/modules/index.html -
Training: Optimizing vLLM for RHEL AI and OpenShift AI:
https://rhpds.github.io/showroom-summit2025-lb2959-neural-magic/modules/index.html -
RH Inference server docs - key vLLM serving arguments
https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/3.1/html-single/vllm_server_arguments/index#key-server-arguments-server-arguments -
vLLM: Optimizing and Serving Models on OpenShift AI https://redhatquickcourses.github.io/genai-vllm/genai-vllm/1/index.html
Potential Topics to Cover in the Lab
vLLM Configuration
-
Sizing KV Cache for GPUs - https://redhatquickcourses.github.io/genai-vllm/genai-vllm/1/model_sizing/index.html
-
Configuring --max-model-length
-
KV Cache Quantization
-
--kv-cache-dtype
-
-
-
vLLM configuration/optimization best practices
-
--served-model-name
-
--tensor-parallel-size
-
--enable-expert-parallel
-
--gpu-memory-utilization
-
--max-num-batched-tokens
-
--enable-eager
-
--limit-mm-per-prompt
-
-
Configuring tool calling
-
Configuring speculative decoding
-
prefill
-
TTFT
-
Intertoken Latency
-
Accuracy vs Latency
-
Int vs Floating point
-
Model Architecture and GPU Architecture
-
Tuning/configuring vLLM
-
Performance analysis