Red Hat build of Kueue Operator Setup
This lab guides you through the installation and basic configuration of the Red Hat build of Kueue, a Kubernetes-native job queueing system. Kueue is essential for advanced GPU quota management, enabling fair resource sharing and workload prioritization, including preemption, within your OpenShift cluster.
1. Prerequisites
-
OpenShift AI Operator: Ensure the OpenShift AI Operator is installed on your cluster.
-
GPU Worker Node: You need at least one worker node with an NVIDIA A10G GPU. On AWS, a
g5.2xlarge
instance is suitable. -
GPU Node Taint: The GPU node must be tainted to ensure only GPU-tolerant workloads are scheduled on it.
This taint was applied during the bootstrap process in the previous lab. If you need to reapply it, use the following command, replacing
|
2. Install Red Hat build of Kueue Operator
This section details the steps to install the Red Hat build of Kueue Operator.
Applying YAML Snippets
The following snippets use |
-
Create the
openshift-kueue-operator
namespace with cluster monitoring enabled.cat <<EOF | oc apply -f - apiVersion: v1 kind: Namespace metadata: name: openshift-kueue-operator annotations: openshift.io/description: "openshift-kueue-operator" openshift.io/display-name: "openshift-kueue-operator" openshift.io/requester: "" labels: openshift.io/cluster-monitoring: "true" spec: finalizers: - kubernetes EOF
-
Create the
Subscription
for the Kueue operator in the newly created namespace.cat <<EOF | oc apply -f - apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: labels: operators.coreos.com/kueue-operator.openshift-kueue-operator: "" name: kueue-operator namespace: openshift-kueue-operator spec: channel: stable-v1.0 installPlanApproval: Automatic name: kueue-operator source: redhat-operators sourceNamespace: openshift-marketplace startingCSV: kueue-operator.v1.0.1 EOF
-
Create the
OperatorGroup
to install the Kueue operator fromredhat-operators
.cat <<EOF | oc apply -f - apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: annotations: olm.providedAPIs: Kueue.v1.kueue.openshift.io name: openshift-kueue-operator namespace: openshift-kueue-operator spec: upgradeStrategy: Default status: namespaces: - "" EOF
-
Wait for the Kueue operator pod to be fully deployed and running.
oc get pods -n openshift-kueue-operator -w
You should see output similar to this, indicating the controller pod is
Running
:NAME READY STATUS RESTARTS AGE kueue-controller-xxxxxx-yyyyy 1/1 Running 0 2m
-
Create the global
Kueue
custom resource (CR) namedcluster
. This configures Kueue’s overall behavior, including enabling preemption with aClassical
policy.cat <<EOF | oc apply -f - apiVersion: kueue.openshift.io/v1 kind: Kueue metadata: labels: app.kubernetes.io/name: kueue-operator app.kubernetes.io/managed-by: kustomize name: cluster namespace: openshift-kueue-operator spec: managementState: Managed config: integrations: frameworks: - BatchJob - MPIJob - RayJob - RayCluster - JobSet - Pod - PaddleJob - PyTorchJob - TFJob - XGBoostJob - Deployment - AppWrapper preemption: preemptionPolicy: Classical EOF
After these steps, the Red Hat build of Kueue is installed and running in your cluster. You can verify its status in the OpenShift Web Console by navigating to Operators → Installed Operators → Red Hat build of Kueue → Kueue → Cluster.

3. Install Kueue Vizualization
The Operator does not have a Dashboard yet
Some might experience Websocket issues |
First apply the following configuration:
cat <<EOF | oc apply -f -
kind: Project
apiVersion: project.openshift.io/v1
metadata:
name: kueue-system
spec:
finalizers:
- kubernetes
status:
phase: Active
---
# Source: kueue/templates/kueueviz/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: 'kueue-kueueviz-backend-read-access'
namespace: 'kueue-system'
rules:
- apiGroups: ["kueue.x-k8s.io"]
resources: ["workloads", "clusterqueues", "localqueues", "resourceflavors"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods", "events", "nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: ["kueue.x-k8s.io"]
resources: ["workloadpriorityclass"]
verbs: ["get", "list", "watch"]
---
# Source: kueue/templates/kueueviz/cluster-role-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: 'kueue-kueueviz-backend-read-access-binding'
namespace: 'kueue-system'
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: 'kueue-kueueviz-backend-read-access'
subjects:
- kind: ServiceAccount
name: default
namespace: 'kueue-system'
---
# Source: kueue/templates/kueueviz/backend-service.yaml
apiVersion: v1
kind: Service
metadata:
name: 'kueue-kueueviz-backend'
namespace: 'kueue-system'
spec:
type: ClusterIP
ports:
- port: 8080
targetPort: 8080
selector:
app: kueueviz-backend
---
# Source: kueue/templates/kueueviz/frontend-service.yaml
apiVersion: v1
kind: Service
metadata:
name: 'kueue-kueueviz-frontend'
namespace: 'kueue-system'
spec:
type: ClusterIP
ports:
- port: 8080
targetPort: 8080
selector:
app: kueueviz-frontend
---
# Source: kueue/templates/kueueviz/backend-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: 'kueue-kueueviz-backend'
namespace: 'kueue-system'
spec:
replicas: 1
selector:
matchLabels:
app: kueueviz-backend
template:
metadata:
labels:
app: kueueviz-backend
spec:
containers:
- name: backend
image: 'registry.k8s.io/kueue/kueueviz-backend:v0.13.4'
imagePullPolicy: 'IfNotPresent'
ports:
- containerPort: 8080
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 500m
memory: 512Mi
---
# Source: kueue/templates/kueueviz/frontend-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: 'kueue-kueueviz-frontend'
namespace: 'kueue-system'
spec:
replicas: 1
selector:
matchLabels:
app: kueueviz-frontend
template:
metadata:
labels:
app: kueueviz-frontend
spec:
containers:
- name: frontend
image: 'registry.k8s.io/kueue/kueueviz-frontend:v0.13.4'
imagePullPolicy: 'IfNotPresent'
ports:
- containerPort: 8080
env:
- name: REACT_APP_WEBSOCKET_URL
value: 'wss://backend.kueueviz.local'
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 500m
memory: 512Mi
EOF
Wait for the pods on the kueue-system namespace to be in Running state. You can check the status with:
You should see output similar to this, indicating the pods are
|
In case you are having issues with the scheduling of the pods, you can try to increase the number of workers in your cluster. (via |
Then, you can access the Kueue Vizualization UI by port-forwarding the backend and frontend services to your local machine. Run the following commands in your terminal:
oc -n kueue-system port-forward svc/kueue-kueueviz-backend 8080:8080 &
oc -n kueue-system set env deployment kueue-kueueviz-frontend REACT_APP_WEBSOCKET_URL=ws://localhost:8080
oc -n kueue-system port-forward svc/kueue-kueueviz-frontend 3000:8080
Open http://localhost:3000/ in the browser.
References
-
[1] Kueue. Documentation. Available from: https://kueue.sigs.k8s.io/docs/overview/.
-
[2] AI on OpenShift Contrib Repo. Kueue Preemption Example. Available from: https://github.com/opendatahub-io-contrib/ai-on-openshift.