Work in progress!
We have successfully deployed vLLM on a single node with multiple GPUs, the natural next step is going to deploy a vLLM instance over multiple nodes with multiple GPUs.
In order to do that, we’ll need to process and deploy a template from the
project:redhat-ods-applications
oc process vllm-multinode-runtime-template -n redhat-ods-applications|oc apply -n vllm -f -
In order for this architecture to work, we need the model to be stored in a RWX PVC, which we already prepared for your convenience in the project:
oc get pvc llama-model
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
llama-model Bound pvc-3d02084b-66d8-4e49-9eac-1bf9ab4f05ed 60Gi RWX ocs-storagecluster-cephfs <unset> 25h
they’re gonna run into issues, here’s the working ServingRuntime object: |
---
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
annotations:
opendatahub.io/accelerator-name: migrated-gpu
opendatahub.io/apiProtocol: REST
opendatahub.io/hardware-profile-name: ""
opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
opendatahub.io/template-display-name: vLLM Multi-Node ServingRuntime for KServe
openshift.io/display-name: vllm-multinode-runtime
creationTimestamp: "2025-07-16T09:33:09Z"
generation: 4
labels:
opendatahub.io/dashboard: "true"
name: vllm-multinode-runtime
namespace: vllm
resourceVersion: "3105432"
uid: d3a31093-c548-43ea-a23f-96b8b12b9883
spec:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8080"
containers:
- args:
- "ray start --head --disable-usage-stats --include-dashboard false \n# wait for
other node to join\nuntil [[ $(ray status --address ${RAY_ADDRESS} | grep -c
node_) -eq ${PIPELINE_PARALLEL_SIZE} ]]; do\n echo \"Waiting...\"\n sleep
1\ndone\nray status --address ${RAY_ADDRESS}\n\nexport SERVED_MODEL_NAME=${MODEL_NAME}\nexport
MODEL_NAME=${MODEL_DIR} \n\nexec python3 -m vllm.entrypoints.openai.api_server
--port=8080 --max-model-len=100000 --distributed-executor-backend ray --model=${MODEL_NAME}
--served-model-name=${SERVED_MODEL_NAME} --tensor-parallel-size=${TENSOR_PARALLEL_SIZE}
--pipeline-parallel-size=${PIPELINE_PARALLEL_SIZE} --disable_custom_all_reduce\n"
command:
- bash
- -c
env:
- name: RAY_USE_TLS
value: "1"
- name: RAY_TLS_SERVER_CERT
value: /etc/ray/tls/tls.pem
- name: RAY_TLS_SERVER_KEY
value: /etc/ray/tls/tls.pem
- name: RAY_TLS_CA_CERT
value: /etc/ray/tls/ca.crt
- name: RAY_PORT
value: "6379"
- name: RAY_ADDRESS
value: 127.0.0.1:6379
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: VLLM_NO_USAGE_STATS
value: "1"
- name: HOME
value: /tmp
- name: HF_HOME
value: /tmp/hf_home
image: quay.io/modh/vllm@sha256:79e1f24bba1d3e694f47f66ba9f8184e70310a10b77bf11c0febd0c926234950
livenessProbe:
exec:
command:
- bash
- -c
- "# Check if the registered ray nodes count is greater or the same than PIPELINE_PARALLEL_SIZE\nregistered_node_count=$(ray
status --address ${RAY_ADDRESS} | grep -c node_)\nif [[ ! ${registered_node_count}
-ge \"${PIPELINE_PARALLEL_SIZE}\" ]]; then\n echo \"Unhealthy - Registered
nodes count (${registered_node_count}) must not be less PIPELINE_PARALLEL_SIZE
(${PIPELINE_PARALLEL_SIZE}).\"\n exit 1\nfi \n\n# Check model health\nhealth_check=$(curl
-o /dev/null -s -w \"%{http_code}\\n\" http://localhost:8080/health)\nif
[[ ${health_check} != 200 ]]; then\n echo \"Unhealthy - vLLM Runtime Health
Check failed.\" \n exit 1\nfi \n"
failureThreshold: 2
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 15
name: kserve-container
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
exec:
command:
- bash
- -c
- "# Check model health \nhealth_check=$(curl -o /dev/null -s -w \"%{http_code}\\n\"
http://localhost:8080/health)\nif [[ ${health_check} != 200 ]]; then\n echo
\"Unhealthy - vLLM Runtime Health Check failed.\" \n exit 1\nfi\n"
failureThreshold: 2
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 15
startupProbe:
exec:
command:
- bash
- -c
- "# This need when head node have issues and restarted.\n# It will wait for
new worker node. \nregistered_node_count=$(ray status
--address ${RAY_ADDRESS} | grep -c node_)\nif [[ ! ${registered_node_count}
-ge \"${PIPELINE_PARALLEL_SIZE}\" ]]; then\n echo \"Unhealthy - Registered
nodes count (${registered_node_count}) must not be less PIPELINE_PARALLEL_SIZE
(${PIPELINE_PARALLEL_SIZE}).\"\n exit 1\nfi\n\n# Double check to make sure
Model is ready to serve.\nfor i in 1 2; do \n # Check
model health\n health_check=$(curl -o /dev/null -s -w \"%{http_code}\\n\"
http://localhost:8080/health)\n if [[ ${health_check} != 200 ]]; then\n
\ echo \"Unhealthy - vLLM Runtime Health Check failed.\" \n exit 1\n
\ fi\ndone\n"
failureThreshold: 40
initialDelaySeconds: 20
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 30
volumeMounts:
- mountPath: /dev/shm
name: shm
- mountPath: /etc/ray/tls
name: ray-tls
multiModel: false
supportedModelFormats:
- autoSelect: true
name: vLLM
priority: 2
volumes:
- emptyDir:
medium: Memory
sizeLimit: 12Gi
name: shm
- emptyDir: {}
name: ray-tls
- name: ray-tls-secret
secret:
secretName: ray-tls
workerSpec:
containers:
- args:
- "SECONDS=0\n\nwhile true; do \n if (( SECONDS <= 240 )); then\n
\ if ray health-check --address \"${HEAD_SVC}.${POD_NAMESPACE}.svc.cluster.local:6379\"
> /dev/null 2>&1; then\n echo \"Global Control Service(GCS) is ready.\"\n
\ break\n fi\n echo \"$SECONDS seconds elapsed: Waiting for Global
Control Service(GCS) to be ready.\"\n else\n if ray health-check --address
\"${HEAD_SVC}.${POD_NAMESPACE}.svc.cluster.local:6379\"; then\n echo
\"Global Control Service(GCS) is ready. Any error messages above can be safely
ignored.\"\n break\n fi\n echo \"$SECONDS seconds elapsed: Still
waiting for Global Control Service(GCS) to be ready.\"\n echo \"For troubleshooting,
refer to the FAQ at https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting/troubleshooting.html#kuberay-troubleshootin-guides\"\n
\ fi\n\n sleep 5\ndone\n\nexport RAY_HEAD_ADDRESS=\"${HEAD_SVC}.${POD_NAMESPACE}.svc.cluster.local:6379\"\necho
\"Attempting to connect to Ray cluster at $RAY_HEAD_ADDRESS ...\"\nray start
--address=\"${RAY_HEAD_ADDRESS}\" --block\n"
command:
- bash
- -c
env:
- name: RAY_USE_TLS
value: "1"
- name: RAY_TLS_SERVER_CERT
value: /etc/ray/tls/tls.pem
- name: RAY_TLS_SERVER_KEY
value: /etc/ray/tls/tls.pem
- name: RAY_TLS_CA_CERT
value: /etc/ray/tls/ca.crt
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
image: quay.io/modh/vllm@sha256:79e1f24bba1d3e694f47f66ba9f8184e70310a10b77bf11c0febd0c926234950
livenessProbe:
exec:
command:
- bash
- -c
- |
# Check if the registered nodes count matches PIPELINE_PARALLEL_SIZE
registered_node_count=$(ray status --address ${HEAD_SVC}.${POD_NAMESPACE}.svc.cluster.local:6379 | grep -c node_)
if [[ ! ${registered_node_count} -ge "${PIPELINE_PARALLEL_SIZE}" ]]; then
echo "Unhealthy - Registered nodes count (${registered_node_count}) must not be less PIPELINE_PARALLEL_SIZE (${PIPELINE_PARALLEL_SIZE})."
exit 1
fi
failureThreshold: 2
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 15
name: worker-container
resources:
limits:
cpu: "16"
memory: 48Gi
requests:
cpu: "8"
memory: 24Gi
startupProbe:
exec:
command:
- /bin/sh
- -c
- "registered_node_count=$(ray status --address ${HEAD_SVC}.${POD_NAMESPACE}.svc.cluster.local:6379
| grep -c node_)\nif [[ ! ${registered_node_count} -ge \"${PIPELINE_PARALLEL_SIZE}\"
]]; then\n echo \"Unhealthy - Registered nodes count (${registered_node_count})
must not be less PIPELINE_PARALLEL_SIZE (${PIPELINE_PARALLEL_SIZE}).\"\n
\ exit 1\nfi \n\n# Double check to make sure Model is ready to serve.\nfor
i in 1 2; do\n # Check model health\n model_health_check=$(curl -s ${HEAD_SVC}.${POD_NAMESPACE}.svc.cluster.local:8080/v1/models|grep
-o ${ISVC_NAME})\n if [[ ${model_health_check} != \"${ISVC_NAME}\" ]];
then\n echo \"Unhealthy - vLLM Runtime Health Check failed.\"\n exit
1\n fi \n sleep 10\ndone\n"
failureThreshold: 40
initialDelaySeconds: 20
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 30
volumeMounts:
- mountPath: /dev/shm
name: shm
- mountPath: /etc/ray/tls
name: ray-tls
pipelineParallelSize: 2
tensorParallelSize: 2
volumes:
- emptyDir:
medium: Memory
sizeLimit: 12Gi
name: shm
- emptyDir: {}
name: ray-tls
- name: ray-tls-secret
secret:
secretName: ray-tls
Once that’s done, we need to create the InferenceService via CLI as it’s unavailable through the UI:
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
annotations:
openshift.io/display-name: vllm-multi-node-llama
serving.kserve.io/autoscalerClass: external
serving.kserve.io/deploymentMode: RawDeployment
labels:
networking.kserve.io/visibility: exposed
opendatahub.io/dashboard: "true"
name: vllm-multi-node-llama
namespace: vllm
spec:
predictor:
maxReplicas: 1
minReplicas: 1
model:
args:
- --max-model-len=100000
modelFormat:
name: vLLM
name: ""
resources:
limits:
cpu: "16"
memory: 48Gi
nvidia.com/gpu: "2"
requests:
cpu: "8"
memory: 24Gi
nvidia.com/gpu: "2"
runtime: vllm-multinode-runtime
storageUri: pvc://llama-model/Llama-3.3-70B-Instruct-quantized.w4a16
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
workerSpec:
tensorParallelSize: 2
pipelineParallelSize: 2
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
After that, we’ll have our two pods starting a ray cluster, and once they both join the cluster vLLM will start and the model will be served
vllm-multi-node-llama-predictor-5484cb9cfc-lgzd9 - head
2025-07-16 13:47:37,113 INFO usage_lib.py:441 -- Usage stats collection is disabled.
2025-07-16 13:47:37,114 INFO scripts.py:971 -- Local node IP: 10.130.2.44
2025-07-16 13:47:39,012 SUCC scripts.py:1007 -- --------------------
2025-07-16 13:47:39,012 SUCC scripts.py:1008 -- Ray runtime started.
2025-07-16 13:47:39,012 SUCC scripts.py:1009 -- --------------------
2025-07-16 13:47:39,012 INFO scripts.py:1011 -- Next steps
2025-07-16 13:47:39,012 INFO scripts.py:1014 -- To add another node to this Ray cluster, run
2025-07-16 13:47:39,012 INFO scripts.py:1017 -- ray start --address='10.130.2.44:6379'
2025-07-16 13:47:39,012 INFO scripts.py:1026 -- To connect to this Ray cluster:
2025-07-16 13:47:39,012 INFO scripts.py:1028 -- import ray
2025-07-16 13:47:39,012 INFO scripts.py:1029 -- ray.init()
2025-07-16 13:47:39,012 INFO scripts.py:1060 -- To terminate the Ray runtime, run
2025-07-16 13:47:39,012 INFO scripts.py:1061 -- ray stop
2025-07-16 13:47:39,012 INFO scripts.py:1064 -- To view the status of the cluster, use
2025-07-16 13:47:39,012 INFO scripts.py:1065 -- ray status
Waiting...
======== Autoscaler status: 2025-07-16 13:47:43.159225 ========
Node status
---------------------------------------------------------------
Active:
1 node_47c67b04c64a3da5738a91410e3ed893c7c0792104a52da130564287
1 node_f266b870619c7f307db957e9c57d74934e5f7a2a4fe9b381b9357e68
[...]
INFO 07-16 13:47:56 [__init__.py:239] Automatically detected platform cuda.
INFO 07-16 13:47:58 [api_server.py:1034] vLLM API server version 0.8.5.dev411+g7ad990749
vllm-multi-node-llama-predictor-worker-8b97c66f9-7m29v
Global Control Service(GCS) is ready.
Attempting to connect to Ray cluster at vllm-multi-node-llama-head-1.vllm.svc.cluster.local:6379 ...
2025-07-16 13:47:40,785 INFO scripts.py:1152 -- Local node IP: 10.131.2.61
2025-07-16 13:47:40,957 SUCC scripts.py:1168 -- --------------------
2025-07-16 13:47:40,957 SUCC scripts.py:1169 -- Ray runtime started.
2025-07-16 13:47:40,957 SUCC scripts.py:1170 -- --------------------
2025-07-16 13:47:40,957 INFO scripts.py:1172 -- To terminate the Ray runtime, run
2025-07-16 13:47:40,957 INFO scripts.py:1173 -- ray stop
2025-07-16 13:47:40,958 INFO scripts.py:1181 -- --block
2025-07-16 13:47:40,958 INFO scripts.py:1182 -- This command will now block forever until terminated by a signal.
2025-07-16 13:47:40,958 INFO scripts.py:1185 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.