Inference Server on Multiple Platforms

Existing lab resources

RH Inference server on multiple platforms
https://github.com/redhat-ai-services/inference-service-on-multiple-platforms
RH Inference server tutorial
https://docs.google.com/document/d/11-Oiomiih78dBjIfClISSQBKqb0Ij4UJg31g0dO5XIc/edit?usp=sharing

Potential Topics to Cover in the Lab

RHEL

Host Verification

Before proceeding, it is critical to verify that the host environment is correctly configured. Check Driver Status: After the system reboots, run the nvidia-smi (NVIDIA System Management Interface) command. A successful configuration will display a table listing all detected NVIDIA GPUs, their driver versions, and CUDA versions.14

nvidia-smi

Install nvidia-container-toolkit

The NVIDIA Container Toolkit is the crucial bridge that allows container runtimes like Podman or Docker to securely access the host’s GPUs.

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf-config-manager --enable nvidia-container-toolkit-experimental
sudo dnf install -y nvidia-container-toolkit

Configure CDI

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# check the config
nvidia-ctk cdi list

Test Container-GPU Access: To confirm that Podman can access the GPUs, run a simple test workload using a standard NVIDIA CUDA sample image. This step definitively validates the entire stack, from the driver to the container runtime.

sudo podman run --rm --device nvidia.com/gpu=all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
podman run --rm -it \
--security-opt=label=disable \
--device nvidia.com/gpu=all \
nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
nvidia-smi

Logging Into Red Hat Container Registry

sudo podman login registry.redhat.io

Running vLLM on RHEL

Clone the repository with RH Inference server for RHEL

git clone https://github.com/redhat-ai-services/etx-llm-optimization-and-inference-leveraging.git

Run the vllm pod

sudo podman kube play etx-llm-optimization-and-inference-leveraging/optimization_lab/rhel/vllm-pod.yaml

Open a new terminal to follow the logs.

sudo podman logs --follow vllm-vllm

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 100
}' http://127.0.0.1:80/v1/completions | jq

Go to your other terminal and view the logs. You should see a successful log entry.

Run a quick benchmark test

We’ll run a quick benchmark test to show how the RH Inference server performs.

pip install guidellm

Run a default benchmark

guidellm benchmark \
  --target "http://127.0.0.1:80/v1" \
  --model "granite-3.0-2b-instruct" \
  --rate-type sweep \
  --max-seconds 30 \
  --data "prompt_tokens=256,output_tokens=128"

Remove and cleanup vllm pod

sudo podman pod stop vllm && sudo podman pod rm vllm Follow logs

sudo podman logs --follow vllm-vllm curl http://localhost/version # this is accessible from the internet curl http://<public-ip-address>/version