LLM Optimization and Inferencing

Welcome to our LLM Optimization and Inferencing hands-on workshop, where you will gain technical experience serving models with vLLM and optimizing performance and accuracy in a number of ways.

Module Overview

This workshop provides hands-on experience with enterprise vLLM deployment, benchmarking, and optimization techniques. You’ll learn to deploy models efficiently, evaluate performance, and implement advanced optimization techniques to reduce costs while maintaining quality.

🚀 Module 1: LLM Deployment

Deploy RH Inference Server across multiple platforms

  • 1.1 RHEL Deployment: Set up inference server on Red Hat Enterprise Linux with GPU support, container toolkit configuration, and model serving

  • 1.2 OpenShift Deployment: Deploy using Helm charts and container orchestration for scalable inference

  • 1.3 OpenShift AI Deployment: Leverage Red Hat OpenShift AI platform for managed LLM serving with enterprise features

  • 1.4 Platform Comparison: Understand deployment trade-offs and choose the right platform for your use case

Key Skills: Infrastructure setup, containerization, GPU configuration, cloud-native deployment

📊 Module 2: Performance & Accuracy Evaluation

Measure and benchmark LLM systems for production readiness

  • 2.1 Performance Evaluation: Use GuideLLM to measure latency, throughput, and resource utilization under realistic workloads

  • 2.2 Accuracy Assessment: Evaluate model quality, response relevance, and task-specific performance metrics

  • 2.3 Evaluation Best Practices: Establish benchmarking workflows and continuous performance monitoring

Key Skills: Performance testing, quality assessment, benchmarking methodologies, production readiness validation

⚡ Module 3: vLLM Optimization

Maximize inference performance through tuning and configuration

  • 3.1 Performance Tuning: Hands-on optimization of granite-3.3-8b-instruct for minimal latency in chat applications

  • 3.2 Configuration Strategies: Master vLLM parameters, memory management, and batching for optimal performance

  • 3.3 Scaling Techniques: Implement strategies for high-throughput serving and resource efficiency

Key Skills: Performance optimization, parameter tuning, inference scaling, latency reduction

🔬 Module 4: Model Quantization

Reduce model size and memory requirements without sacrificing quality

  • 4.1 Quantization Fundamentals: Understand W4A16, W8A8 schemes and their impact on performance and accuracy

  • 4.2 Implementation Labs: Hands-on quantization using LLM Compressor with SmoothQuant and GPTQ techniques

  • 4.3 Production Pipelines: Build automated quantization workflows using OpenShift AI and evaluate results

Key Skills: Model compression, quantization techniques, memory optimization, automated ML pipelines

📚 Reference Materials

Business and technical guides for real-world application

  • Enterprise Qualification Guide: Framework for identifying and qualifying LLM optimization opportunities with enterprise clients

  • Technical Deep Dives: Comprehensive technical documentation on quantization methods and optimization strategies

  • Model Comparison Examples: Pre-compressed model performance comparisons and selection criteria

🎯 Learning Outcomes

By the end of this workshop, you will be able to:

  • Deploy production-ready LLM inference servers across multiple platforms

  • Evaluate and benchmark LLM systems for performance and accuracy

  • Optimize vLLM configurations for specific use cases and constraints

  • Implement quantization techniques to reduce costs by 50-75%

  • Build automated optimization pipelines for enterprise deployment

  • Qualify and position LLM optimization opportunities with technical confidence

⏱️ Workshop Format

  • Duration: Full-day technical workshop

  • Format: Mix of theory, hands-on labs, and real-world scenarios

  • Prerequisites: Basic familiarity with containers, Kubernetes, and machine learning concepts

  • Environment: Access to OpenShift cluster with GPU resources