LLM Optimization and Inferencing

Welcome to our LLM Optimization and Inferencing hands-on workshop, where you will gain technical experience serving models with vLLM and optimizing performance and accuracy in a number of ways.

Module Overview

This workshop provides hands-on experience with enterprise vLLM deployment, benchmarking, and optimization techniques. You’ll learn to deploy models efficiently, evaluate performance, and implement advanced optimization techniques to reduce costs while maintaining quality.

🚀 Module 1: LLM Deployment

Deploy RH Inference Server across multiple platforms

1.1 RHEL Deployment: Set up inference server on Red Hat Enterprise Linux with GPU support, container toolkit configuration, and model serving
1.2 OpenShift Deployment: Deploy using Helm charts and container orchestration for scalable inference
1.3 OpenShift AI Deployment: Leverage Red Hat OpenShift AI platform for managed LLM serving with enterprise features
1.4 Platform Comparison: Understand deployment trade-offs and choose the right platform for your use case

Key Skills: Infrastructure setup, containerization, GPU configuration, cloud-native deployment

📊 Module 2: Performance & Accuracy Evaluation

Measure and benchmark LLM systems for production readiness

2.1 Performance Evaluation: Use GuideLLM to measure latency, throughput, and resource utilization under realistic workloads
2.2 Accuracy Assessment: Evaluate model quality, response relevance, and task-specific performance metrics
2.3 Evaluation Best Practices: Establish benchmarking workflows and continuous performance monitoring

Key Skills: Performance testing, quality assessment, benchmarking methodologies, production readiness validation

⚡ Module 3: vLLM Optimization

Maximize inference performance through tuning and configuration

3.1 Performance Tuning: Hands-on optimization of granite-3.3-8b-instruct for minimal latency in chat applications
3.2 Configuration Strategies: Master vLLM parameters, memory management, and batching for optimal performance
3.3 Scaling Techniques: Implement strategies for high-throughput serving and resource efficiency

Key Skills: Performance optimization, parameter tuning, inference scaling, latency reduction

🔬 Module 4: Model Quantization

Reduce model size and memory requirements without sacrificing quality

4.1 Quantization Fundamentals: Understand W4A16, W8A8 schemes and their impact on performance and accuracy
4.2 Implementation Labs: Hands-on quantization using LLM Compressor with SmoothQuant and GPTQ techniques
4.3 Production Pipelines: Build automated quantization workflows using OpenShift AI and evaluate results

Key Skills: Model compression, quantization techniques, memory optimization, automated ML pipelines

📚 Reference Materials

Business and technical guides for real-world application

Enterprise Qualification Guide: Framework for identifying and qualifying LLM optimization opportunities with enterprise clients
Technical Deep Dives: Comprehensive technical documentation on quantization methods and optimization strategies
Model Comparison Examples: Pre-compressed model performance comparisons and selection criteria

🎯 Learning Outcomes

By the end of this workshop, you will be able to:

Deploy production-ready LLM inference servers across multiple platforms
Evaluate and benchmark LLM systems for performance and accuracy
Optimize vLLM configurations for specific use cases and constraints
Implement quantization techniques to reduce costs by 50-75%
Build automated optimization pipelines for enterprise deployment
Qualify and position LLM optimization opportunities with technical confidence

⏱️ Workshop Format

Duration: Full-day technical workshop
Format: Mix of theory, hands-on labs, and real-world scenarios
Prerequisites: Basic familiarity with containers, Kubernetes, and machine learning concepts
Environment: Access to OpenShift cluster with GPU resources