Lab Guide: Advanced GPU Quota Management and Preemption with Kueue

Table of Contents

1. Problem Statement
2. Solution Overview
3. Prerequisites
4. Lab Steps
5. Conclusion
References

This lab guide will walk you through setting up a sophisticated resource management scenario on OpenShift using Kueue. You will configure quotas for different teams sharing a common pool of resources and demonstrate how a high-priority team can preempt a lower-priority team’s workload to guarantee access to critical GPUs.

This lab uses a realistic setup with two teams: team-a (high-priority, requires GPUs) and team-b (low-priority, CPU-only), who are part of the same resource-sharing cohort.

1. Problem Statement

In a multi-tenant cluster, managing shared resources presents two major challenges:

Resource Contention: When the cluster is under heavy load, critical, high-priority jobs (e.g., production training for Team A) might get stuck waiting for resources consumed by lower-priority jobs (e.g., development experiments for Team B).
Inefficient Resource Sharing: Different teams have varying needs. A mechanism is required to allow teams to borrow idle resources without disrupting another team’s ability to reclaim them when needed.

2. Solution Overview

This lab demonstrates a solution using Kueue to implement a robust system for quota management and preemption.

Cohort-Based Sharing: Both team-a and team-b will be placed into a single Cohort, allowing them to draw from a common pool of resources defined in a shared-cq (ClusterQueue).
Dedicated and Borrowable Quotas: team-a-cq will have a guaranteed (nominalQuota) for nvidia.com/gpu resources, while team-b-cq will not. Both will borrow CPU/Memory from the shared-cq.
Priority and Preemption: team-a-cq will be configured with a preemption policy. If Team A submits a job and the cohort lacks sufficient CPU resources, Kueue will find and preempt a lower-priority workload from Team B to free up capacity.

By the end of this lab, you will have deployed a lower-priority RayCluster, watched it run, and then deployed a higher-priority RayCluster that successfully preempts it.

3. Prerequisites

OpenShift AI Operator: Ensure the OpenShift AI Operator is installed.
GPU Worker Node: You need at least one GPU-enabled worker node in your cluster.
GPU Node Taint: The GPU node must be tainted to reserve it for GPU workloads.

This was done during the bootstrap process. If you need to reapply the taint, use this command:

oc adm taint nodes <your-gpu-node-name> nvidia.com/gpu=Exists:NoSchedule --overwrite

4. Lab Steps

4.1. 1. Configure the Multi-Team Environment

First, apply all the necessary configuration objects. This includes namespaces, resource flavors, and the Kueue queues with the correct quotas and preemption policies.

Run the following on the terminal to create the namespaces and resources needed for the Lab.

cat <<EOF | oc create -f -
apiVersion: v1
kind: Namespace
metadata:
  labels:
    kubernetes.io-metadata.name: team-a
    opendatahub.io/dashboard: "true"
    kueue.openshift.io/managed: "true"
  name: team-a
---
apiVersion: v1
kind: Namespace
metadata:
  labels:
    kubernetes.io/metadata.name: team-b
    opendatahub.io/dashboard: "true"
    kueue.openshift.io/managed: "true"
  name: team-b
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: edit
  namespace: team-a
subjects:
  - kind: ServiceAccount
    name: default
    namespace: team-a
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: edit
  namespace: team-b
subjects:
  - kind: ServiceAccount
    name: default
    namespace: team-b
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default-flavor
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: gpu-flavor
spec:
  nodeLabels:
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "shared-cq"
spec:
  preemption:
    reclaimWithinCohort: Any
    borrowWithinCohort:
      policy: LowerPriority
      maxPriorityThreshold: 100
    withinClusterQueue: Never
  namespaceSelector: {} # match all.
  cohort: "team-ab"
  resourceGroups:
  - coveredResources:
    - cpu
    - memory
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 6 # This is the shared pool for the cohort
      - name: "memory"
        nominalQuota: 16Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-a-cq
spec:
  preemption:
    reclaimWithinCohort: Any
    borrowWithinCohort:
      policy: LowerPriority # Preempt lower-priority workloads in the cohort
      maxPriorityThreshold: 100
    withinClusterQueue: LowerPriority # Reclaim from lower-priority workloads in the cohort
  namespaceSelector:
    matchLabels:
      kubernetes.io/metadata.name: team-a
  queueingStrategy: BestEffortFIFO
  cohort: team-ab
  resourceGroups:
  - coveredResources:
    - cpu
    - memory
    flavors:
    - name: default-flavor
      resources:
      - name: cpu
        nominalQuota: 0 # Must borrow CPU from the cohort
      - name: memory
        nominalQuota: 0
  - coveredResources:
    - nvidia.com/gpu
    flavors:
    - name: gpu-flavor
      resources:
      - name: nvidia.com/gpu
        nominalQuota: "1"  # Guaranteed GPU quota for Team A

---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-b-cq
spec:
  namespaceSelector:
    matchLabels:
      kubernetes.io/metadata.name: team-b
  queueingStrategy: BestEffortFIFO
  cohort: team-ab
  resourceGroups:
  - coveredResources:
    - nvidia.com/gpu
    flavors:
    - name: gpu-flavor
      resources:
      - name: nvidia.com/gpu
        nominalQuota: "0" # No GPU quota for Team B
        borrowingLimit: "0"
  - coveredResources:
    - cpu
    - memory
    flavors:
    - name: default-flavor
      resources:
      - name: cpu
        nominalQuota: 0 # Must borrow CPU from the cohort
      - name: memory
        nominalQuota: 0
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: local-queue
  namespace: team-a
spec:
  clusterQueue: team-a-cq
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: local-queue
  namespace: team-b
spec:
  clusterQueue: team-b-cq
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: prod-priority
value: 1000
description: "Priority class for prod jobs"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: dev-priority
value: 100
description: "Priority class for development jobs"
EOF

Verify the setup by checking the ClusterQueue objects.
```
oc get cq
```
You should see team-a-cq, team-b-cq, and shared-cq listed with a status of Active.

4.2. 2. Deploy the Low-Priority Workload (Team B)

Now, acting as Team B, submit a RayCluster job. This job requests 4 CPU cores, consuming the entire shared quota.

Let’s create the Team B Ray Cluster using the following command on the terminal.

cat <<EOF | oc create -f -
# Team B is using dev-priority
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    kueue.x-k8s.io/queue-name: local-queue
    kueue.x-k8s.io/priority-class: dev-priority # Lower priority
  name: raycluster-dev
  namespace: team-b
spec:
  rayVersion: 2.7.0
  headGroupSpec:
    template:
      spec:
        containers:
        - name: ray-head
          image: quay.io/project-codeflare/ray:2.20.0-py39-cu118
          resources:
            limits: { cpu: "2", memory: 3G }
            requests: { cpu: "2", memory: 3G }
    rayStartParams: {}
  workerGroupSpecs:
  - groupName: worker-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    template:
      spec:
        containers:
        - name: machine-learning
          image: quay.io/project-codeflare/ray:2.20.0-py39-cu118
          resources:
            limits: { cpu: "2", memory: 3G }
            requests: { cpu: "2", memory: 3G }
    rayStartParams: {}
EOF

Verify that the job is admitted and running.
Check the Kueue workload status; ADMITTED should be True.
oc get workload -n team-b
Check that the pods are Running.
oc get pods -n team-b -w

At this point, Team B’s job has successfully claimed all 4 CPUs from the shared cohort.

4.3. 3. Deploy the High-Priority Workload (Team A)

Next, as Team A, submit a RayCluster that requires a GPU and 4 CPU cores. Since the CPU pool is full, Kueue must preempt Team B’s job.

Let’s create the Team A Ray Cluster using the following command on the terminal.

cat <<EOF | oc create -f -
# Team A is using prod-priority and will prempt team A because shared-cq quota
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    kueue.x-k8s.io/queue-name: local-queue
    kueue.x-k8s.io/priority-class: prod-priority # Higher priority
  name: raycluster-prod
  namespace: team-a
spec:
  rayVersion: 2.7.0
  headGroupSpec:
    template:
      spec:
        containers:
        - name: ray-head
          image: quay.io/project-codeflare/ray:2.20.0-py39-cu118
          resources:
            limits: { cpu: "2", memory: 3G }
            requests: { cpu: "2", memory: 3G }
    rayStartParams: {}
  workerGroupSpecs:
  - groupName: worker-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    template:
      spec:
        containers:
        - name: machine-learning
          image: quay.io/project-codeflare/ray:2.20.0-py39-cu118
          resources:
            limits: { cpu: "2", memory: 3G, "nvidia.com/gpu": "1" }
            requests: { cpu: "2", memory: 3G, "nvidia.com/gpu": "1" }
        tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
    rayStartParams: {}
EOF

4.4. 4. Observe and Verify Preemption

This is the key part of the lab. We will watch as Kueue automatically evicts Team B’s workload.

Watch the status of the workloads in both namespaces. The change should happen within a minute.
```
oc get workload -A -w
```
You will see the raycluster-dev workload in team-b switch its ADMITTED status from True to False. Shortly after, the raycluster-prod workload in team-a will switch its ADMITTED status to True.
Check the pods in both namespaces.
Team B’s pods should now be in the Terminating state.
oc get pods -n team-b -w
Team A’s pods should be in the ContainerCreating or Running state.
oc get pods -n team-a

To see the explicit preemption message, describe Team B’s workload change the name of the workload to the right one.

oc describe workload -n team-b **raycluster-raycluster-dev**

Look for the Events section at the bottom. You will see a clear message stating that the workload was Evicted because it was preempted by the higher-priority workload.

Example Event Output

Events:
  Type     Reason         Age    From             Message
  ----     ------         ----   ----             -------
  Normal   Preempted      2m16s  kueue-admission  Preempted to accommodate a workload (UID: 8b76853e-b03f-4dee-a57e-0a9157b5c8a3, JobUID: 4a7827c1-20c9-461e-b369-5e5d029630ff) due to reclamation within the cohort while borrowing
  Warning  Pending        103s   kueue-admission  Workload no longer fits after processing another workload
  Warning  Pending        103s   kueue-admission  couldn't assign flavors to pod set worker-group: insufficient unused quota for cpu in flavor default-flavor, 2 more needed

4.5. Cleanup

To clean up all the resources created during this lab, delete the namespaces and the YAML files you created.

Delete the namespaces, which will also remove the RayClusters and other namespaced objects.
```
oc delete ns team-a team-b
```

Delete the cluster-scoped Kueue objects by deleting the setup file.

#!/bin/sh

echo "Deleting all rayclusters"
oc delete raycluster --all --all-namespaces > /dev/null

echo "Deleting all localqueue"
oc delete localqueue --all --all-namespaces > /dev/null

echo "Deleting all clusterqueues"
oc delete clusterqueue --all --all-namespaces > /dev/null

echo "Deleting all resourceflavors"
oc delete resourceflavor --all --all-namespaces > /dev/null

5. Conclusion

You have successfully demonstrated a sophisticated resource management scenario using Kueue. You configured a shared resource cohort for two teams with different priorities, and verified that Kueue’s preemption mechanism works as expected, allowing a high-priority workload to claim resources from a running, lower-priority workload.

This powerful capability is crucial for managing expensive resources like GPUs efficiently and fairly in a multi-tenant AI/ML platform.

References

[1] Kueue. Documentation. Available from: https://kueue.sigs.k8s.io/docs/overview/.
[2] AI on OpenShift Contrib Repo. Kueue Preemption Example. Available from: https://github.com/opendatahub-io-contrib/ai-on-openshift.