vLLM & Performance Tuning

Existing Slides

PSAP LLM Performance Benchmarking - July 11 2025
https://docs.google.com/presentation/d/1IXReNsWRUcy1C9nGsnnhkG_H-OG5UQ2nYS2KmrXr340/edit?usp=sharing

Existing lab resources

Training: vLLM Master Class:
https://redhat-ai-services.github.io/vllm-showroom/modules/index.html
Training: Optimizing vLLM for RHEL AI and OpenShift AI:
https://rhpds.github.io/showroom-summit2025-lb2959-neural-magic/modules/index.html
RH Inference server docs - key vLLM serving arguments
https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/3.1/html-single/vllm_server_arguments/index#key-server-arguments-server-arguments
vLLM: Optimizing and Serving Models on OpenShift AI https://redhatquickcourses.github.io/genai-vllm/genai-vllm/1/index.html

Potential Topics to Cover in the Lab

Securing vLLM Endpoints

Managing service accounts for other apps

Troubleshooting vLLM instances

Where to find events/logs

vLLM Configuration

Sizing KV Cache for GPUs - https://redhatquickcourses.github.io/genai-vllm/genai-vllm/1/model_sizing/index.html
- Configuring --max-model-length
- KV Cache Quantization
  - --kv-cache-dtype
vLLM configuration/optimization best practices
- --served-model-name
- --tensor-parallel-size
- --enable-expert-parallel
- --gpu-memory-utilization
- --max-num-batched-tokens
- --enable-eager
- --limit-mm-per-prompt
Configuring tool calling
Configuring speculative decoding
prefill
TTFT
Intertoken Latency
Accuracy vs Latency
Int vs Floating point
Model Architecture and GPU Architecture
Tuning/configuring vLLM
Performance analysis