vLLM & Performance Tuning

Existing Slides

Potential Topics to Cover in the Lab

Securing vLLM Endpoints

  • Managing service accounts for other apps

Troubleshooting vLLM instances

  • Where to find events/logs

vLLM Configuration

  • Sizing KV Cache for GPUs - https://redhatquickcourses.github.io/genai-vllm/genai-vllm/1/model_sizing/index.html

    • Configuring --max-model-length

    • KV Cache Quantization

      • --kv-cache-dtype

  • vLLM configuration/optimization best practices

    • --served-model-name

    • --tensor-parallel-size

    • --enable-expert-parallel

    • --gpu-memory-utilization

    • --max-num-batched-tokens

    • --enable-eager

    • --limit-mm-per-prompt

  • Configuring tool calling

  • Configuring speculative decoding

  • prefill

  • TTFT

  • Intertoken Latency

  • Accuracy vs Latency

  • Int vs Floating point

  • Model Architecture and GPU Architecture

  • Tuning/configuring vLLM

  • Performance analysis