Lab Guide: Advanced GPU Quota Management and Preemption with Kueue

This lab guide will walk you through setting up a sophisticated resource management scenario on OpenShift using Kueue. You will configure quotas for different teams sharing a common pool of resources and demonstrate how a high-priority team can preempt a lower-priority team’s workload to guarantee access to critical GPUs.

This lab uses a realistic setup with two teams: team-a (high-priority, requires GPUs) and team-b (low-priority, CPU-only), who are part of the same resource-sharing cohort.

1. Problem Statement

In a multi-tenant cluster, managing shared resources presents two major challenges:

  1. Resource Contention: When the cluster is under heavy load, critical, high-priority jobs (e.g., production training for Team A) might get stuck waiting for resources consumed by lower-priority jobs (e.g., development experiments for Team B).

  2. Inefficient Resource Sharing: Different teams have varying needs. A mechanism is required to allow teams to borrow idle resources without disrupting another team’s ability to reclaim them when needed.

2. Solution Overview

This lab demonstrates a solution using Kueue to implement a robust system for quota management and preemption.

  1. Cohort-Based Sharing: Both team-a and team-b will be placed into a single Cohort, allowing them to draw from a common pool of resources defined in a shared-cq (ClusterQueue).

  2. Dedicated and Borrowable Quotas: team-a-cq will have a guaranteed (nominalQuota) for nvidia.com/gpu resources, while team-b-cq will not. Both will borrow CPU/Memory from the shared-cq.

  3. Priority and Preemption: team-a-cq will be configured with a preemption policy. If Team A submits a job and the cohort lacks sufficient CPU resources, Kueue will find and preempt a lower-priority workload from Team B to free up capacity.

By the end of this lab, you will have deployed a lower-priority RayCluster, watched it run, and then deployed a higher-priority RayCluster that successfully preempts it.

3. Prerequisites

  1. OpenShift AI Operator: Ensure the OpenShift AI Operator is installed.

  2. GPU Worker Node: You need at least one GPU-enabled worker node in your cluster.

  3. GPU Node Taint: The GPU node must be tainted to reserve it for GPU workloads.

This was done during the bootstrap process. If you need to reapply the taint, use this command:

oc adm taint nodes <your-gpu-node-name> nvidia.com/gpu=Exists:NoSchedule --overwrite

4. Lab Steps

4.1. 1. Configure the Multi-Team Environment

First, apply all the necessary configuration objects. This includes namespaces, resource flavors, and the Kueue queues with the correct quotas and preemption policies.

  1. Run the following on the terminal to create the namespaces and resources needed for the Lab.

    cat <<EOF | oc create -f -
    apiVersion: v1
    kind: Namespace
    metadata:
      labels:
        kubernetes.io-metadata.name: team-a
        opendatahub.io/dashboard: "true"
        kueue.openshift.io/managed: "true"
      name: team-a
    ---
    apiVersion: v1
    kind: Namespace
    metadata:
      labels:
        kubernetes.io/metadata.name: team-b
        opendatahub.io/dashboard: "true"
        kueue.openshift.io/managed: "true"
      name: team-b
    ---
    kind: RoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: edit
      namespace: team-a
    subjects:
      - kind: ServiceAccount
        name: default
        namespace: team-a
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: edit
    ---
    kind: RoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: edit
      namespace: team-b
    subjects:
      - kind: ServiceAccount
        name: default
        namespace: team-b
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: edit
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: ResourceFlavor
    metadata:
      name: default-flavor
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: ResourceFlavor
    metadata:
      name: gpu-flavor
    spec:
      nodeLabels:
        nvidia.com/gpu.present: "true"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: ClusterQueue
    metadata:
      name: "shared-cq"
    spec:
      preemption:
        reclaimWithinCohort: Any
        borrowWithinCohort:
          policy: LowerPriority
          maxPriorityThreshold: 100
        withinClusterQueue: Never
      namespaceSelector: {} # match all.
      cohort: "team-ab"
      resourceGroups:
      - coveredResources:
        - cpu
        - memory
        flavors:
        - name: "default-flavor"
          resources:
          - name: "cpu"
            nominalQuota: 6 # This is the shared pool for the cohort
          - name: "memory"
            nominalQuota: 16Gi
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: ClusterQueue
    metadata:
      name: team-a-cq
    spec:
      preemption:
        reclaimWithinCohort: Any
        borrowWithinCohort:
          policy: LowerPriority # Preempt lower-priority workloads in the cohort
          maxPriorityThreshold: 100
        withinClusterQueue: LowerPriority # Reclaim from lower-priority workloads in the cohort
      namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: team-a
      queueingStrategy: BestEffortFIFO
      cohort: team-ab
      resourceGroups:
      - coveredResources:
        - cpu
        - memory
        flavors:
        - name: default-flavor
          resources:
          - name: cpu
            nominalQuota: 0 # Must borrow CPU from the cohort
          - name: memory
            nominalQuota: 0
      - coveredResources:
        - nvidia.com/gpu
        flavors:
        - name: gpu-flavor
          resources:
          - name: nvidia.com/gpu
            nominalQuota: "1"  # Guaranteed GPU quota for Team A
    
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: ClusterQueue
    metadata:
      name: team-b-cq
    spec:
      namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: team-b
      queueingStrategy: BestEffortFIFO
      cohort: team-ab
      resourceGroups:
      - coveredResources:
        - nvidia.com/gpu
        flavors:
        - name: gpu-flavor
          resources:
          - name: nvidia.com/gpu
            nominalQuota: "0" # No GPU quota for Team B
            borrowingLimit: "0"
      - coveredResources:
        - cpu
        - memory
        flavors:
        - name: default-flavor
          resources:
          - name: cpu
            nominalQuota: 0 # Must borrow CPU from the cohort
          - name: memory
            nominalQuota: 0
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: LocalQueue
    metadata:
      name: local-queue
      namespace: team-a
    spec:
      clusterQueue: team-a-cq
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: LocalQueue
    metadata:
      name: local-queue
      namespace: team-b
    spec:
      clusterQueue: team-b-cq
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: WorkloadPriorityClass
    metadata:
      name: prod-priority
    value: 1000
    description: "Priority class for prod jobs"
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: WorkloadPriorityClass
    metadata:
      name: dev-priority
    value: 100
    description: "Priority class for development jobs"
    EOF
  2. Verify the setup by checking the ClusterQueue objects.

    oc get cq

    You should see team-a-cq, team-b-cq, and shared-cq listed with a status of Active.

4.2. 2. Deploy the Low-Priority Workload (Team B)

Now, acting as Team B, submit a RayCluster job. This job requests 4 CPU cores, consuming the entire shared quota.

  1. Let’s create the Team B Ray Cluster using the following command on the terminal.

    cat <<EOF | oc create -f -
    # Team B is using dev-priority
    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      labels:
        kueue.x-k8s.io/queue-name: local-queue
        kueue.x-k8s.io/priority-class: dev-priority # Lower priority
      name: raycluster-dev
      namespace: team-b
    spec:
      rayVersion: 2.7.0
      headGroupSpec:
        template:
          spec:
            containers:
            - name: ray-head
              image: quay.io/project-codeflare/ray:2.20.0-py39-cu118
              resources:
                limits: { cpu: "2", memory: 3G }
                requests: { cpu: "2", memory: 3G }
        rayStartParams: {}
      workerGroupSpecs:
      - groupName: worker-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        template:
          spec:
            containers:
            - name: machine-learning
              image: quay.io/project-codeflare/ray:2.20.0-py39-cu118
              resources:
                limits: { cpu: "2", memory: 3G }
                requests: { cpu: "2", memory: 3G }
        rayStartParams: {}
    EOF
  2. Verify that the job is admitted and running.

    Check the Kueue workload status; ADMITTED should be True.

    oc get workload -n team-b

    Check that the pods are Running.

    oc get pods -n team-b -w

At this point, Team B’s job has successfully claimed all 4 CPUs from the shared cohort.

4.3. 3. Deploy the High-Priority Workload (Team A)

Next, as Team A, submit a RayCluster that requires a GPU and 4 CPU cores. Since the CPU pool is full, Kueue must preempt Team B’s job.

  1. Let’s create the Team A Ray Cluster using the following command on the terminal.

    cat <<EOF | oc create -f -
    # Team A is using prod-priority and will prempt team A because shared-cq quota
    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      labels:
        kueue.x-k8s.io/queue-name: local-queue
        kueue.x-k8s.io/priority-class: prod-priority # Higher priority
      name: raycluster-prod
      namespace: team-a
    spec:
      rayVersion: 2.7.0
      headGroupSpec:
        template:
          spec:
            containers:
            - name: ray-head
              image: quay.io/project-codeflare/ray:2.20.0-py39-cu118
              resources:
                limits: { cpu: "2", memory: 3G }
                requests: { cpu: "2", memory: 3G }
        rayStartParams: {}
      workerGroupSpecs:
      - groupName: worker-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        template:
          spec:
            containers:
            - name: machine-learning
              image: quay.io/project-codeflare/ray:2.20.0-py39-cu118
              resources:
                limits: { cpu: "2", memory: 3G, "nvidia.com/gpu": "1" }
                requests: { cpu: "2", memory: 3G, "nvidia.com/gpu": "1" }
            tolerations:
            - key: nvidia.com/gpu
              operator: Exists
              effect: NoSchedule
        rayStartParams: {}
    EOF

4.4. 4. Observe and Verify Preemption

This is the key part of the lab. We will watch as Kueue automatically evicts Team B’s workload.

  1. Watch the status of the workloads in both namespaces. The change should happen within a minute.

    oc get workload -A -w

    You will see the raycluster-dev workload in team-b switch its ADMITTED status from True to False. Shortly after, the raycluster-prod workload in team-a will switch its ADMITTED status to True.

  2. Check the pods in both namespaces.

    Team B’s pods should now be in the Terminating state.

    oc get pods -n team-b -w

    Team A’s pods should be in the ContainerCreating or Running state.

    oc get pods -n team-a
  3. To see the explicit preemption message, describe Team B’s workload change the name of the workload to the right one.

    oc describe workload -n team-b **raycluster-raycluster-dev**

    Look for the Events section at the bottom. You will see a clear message stating that the workload was Evicted because it was preempted by the higher-priority workload.

    Example Event Output
    Events:
      Type     Reason         Age    From             Message
      ----     ------         ----   ----             -------
      Normal   Preempted      2m16s  kueue-admission  Preempted to accommodate a workload (UID: 8b76853e-b03f-4dee-a57e-0a9157b5c8a3, JobUID: 4a7827c1-20c9-461e-b369-5e5d029630ff) due to reclamation within the cohort while borrowing
      Warning  Pending        103s   kueue-admission  Workload no longer fits after processing another workload
      Warning  Pending        103s   kueue-admission  couldn't assign flavors to pod set worker-group: insufficient unused quota for cpu in flavor default-flavor, 2 more needed

4.5. Cleanup

To clean up all the resources created during this lab, delete the namespaces and the YAML files you created.

  1. Delete the namespaces, which will also remove the RayClusters and other namespaced objects.

    oc delete ns team-a team-b
  2. Delete the cluster-scoped Kueue objects by deleting the setup file.

    #!/bin/sh
    
    echo "Deleting all rayclusters"
    oc delete raycluster --all --all-namespaces > /dev/null
    
    echo "Deleting all localqueue"
    oc delete localqueue --all --all-namespaces > /dev/null
    
    echo "Deleting all clusterqueues"
    oc delete clusterqueue --all --all-namespaces > /dev/null
    
    echo "Deleting all resourceflavors"
    oc delete resourceflavor --all --all-namespaces > /dev/null

5. Conclusion

You have successfully demonstrated a sophisticated resource management scenario using Kueue. You configured a shared resource cohort for two teams with different priorities, and verified that Kueue’s preemption mechanism works as expected, allowing a high-priority workload to claim resources from a running, lower-priority workload.

This powerful capability is crucial for managing expensive resources like GPUs efficiently and fairly in a multi-tenant AI/ML platform.

References