LLM Optimization and Inferencing
Welcome to our LLM Optimization and Inferencing hands-on workshop, where you will gain technical experience serving models with vLLM and optimizing performance and accuracy in a number of ways.
Module Overview
This workshop provides hands-on experience with enterprise vLLM deployment, benchmarking, and optimization techniques. You’ll learn to deploy models efficiently, evaluate performance, and implement advanced optimization techniques to reduce costs while maintaining quality.
🚀 Module 1: LLM Deployment
Deploy RH Inference Server across multiple platforms
-
1.1 RHEL Deployment: Set up inference server on Red Hat Enterprise Linux with GPU support, container toolkit configuration, and model serving
-
1.2 OpenShift Deployment: Deploy using Helm charts and container orchestration for scalable inference
-
1.3 OpenShift AI Deployment: Leverage Red Hat OpenShift AI platform for managed LLM serving with enterprise features
-
1.4 Platform Comparison: Understand deployment trade-offs and choose the right platform for your use case
Key Skills: Infrastructure setup, containerization, GPU configuration, cloud-native deployment
📊 Module 2: Performance & Accuracy Evaluation
Measure and benchmark LLM systems for production readiness
-
2.1 Performance Evaluation: Use GuideLLM to measure latency, throughput, and resource utilization under realistic workloads
-
2.2 Accuracy Assessment: Evaluate model quality, response relevance, and task-specific performance metrics
-
2.3 Evaluation Best Practices: Establish benchmarking workflows and continuous performance monitoring
Key Skills: Performance testing, quality assessment, benchmarking methodologies, production readiness validation
⚡ Module 3: vLLM Optimization
Maximize inference performance through tuning and configuration
-
3.1 Performance Tuning: Hands-on optimization of granite-3.3-8b-instruct for minimal latency in chat applications
-
3.2 Configuration Strategies: Master vLLM parameters, memory management, and batching for optimal performance
-
3.3 Scaling Techniques: Implement strategies for high-throughput serving and resource efficiency
Key Skills: Performance optimization, parameter tuning, inference scaling, latency reduction
🔬 Module 4: Model Quantization
Reduce model size and memory requirements without sacrificing quality
-
4.1 Quantization Fundamentals: Understand W4A16, W8A8 schemes and their impact on performance and accuracy
-
4.2 Implementation Labs: Hands-on quantization using LLM Compressor with SmoothQuant and GPTQ techniques
-
4.3 Production Pipelines: Build automated quantization workflows using OpenShift AI and evaluate results
Key Skills: Model compression, quantization techniques, memory optimization, automated ML pipelines
📚 Reference Materials
Business and technical guides for real-world application
-
Enterprise Qualification Guide: Framework for identifying and qualifying LLM optimization opportunities with enterprise clients
-
Technical Deep Dives: Comprehensive technical documentation on quantization methods and optimization strategies
-
Model Comparison Examples: Pre-compressed model performance comparisons and selection criteria
🎯 Learning Outcomes
By the end of this workshop, you will be able to:
-
Deploy production-ready LLM inference servers across multiple platforms
-
Evaluate and benchmark LLM systems for performance and accuracy
-
Optimize vLLM configurations for specific use cases and constraints
-
Implement quantization techniques to reduce costs by 50-75%
-
Build automated optimization pipelines for enterprise deployment
-
Qualify and position LLM optimization opportunities with technical confidence