Lab Introduction: GPU as a Service with GPU slicing and Kueue
This lab explores GPU slicing and its application in workload prioritization. To fully grasp the significance of these topics, a solid understanding of workload sizing is essential. Therefore, we will begin by demonstrating the vRAM calculation required to serve an ibm-granite/granite-3.3-2b-instruct
model. This example will underscore the importance of slicing GPUs into smaller, more efficient units.
1. Model Size
We will leverage vLLM (the engine used in the Red Hat Inference Server) and its use of PagedAttention, a technique that optimizes GPU memory management. To maximize the return on investment (ROI) of expensive GPU hardware, it is crucial to understand the precise memory consumption of a model. The following calculation for the ibm-granite/granite-3.3-2b-instruct
model demonstrates that an entire GPU is often unnecessary, and without slicing techniques, valuable resources would be wasted.
-
Model Weights: This is the memory required to load the model’s parameters into the GPU.
-
KV Cache (Key-Value Cache): This is dynamic memory used to store attention keys and values for active sequences. While vLLM’s PagedAttention optimizes this, it still consumes significant memory, especially with high concurrency and long sequence lengths.
1.1. Model Weights
The granite-3.3-2b-instruct
model has 2.53 billion parameters. For cleaner calculations, we will approximate this as 2.5B. The memory usage for model weights depends on the data type (precision) you load it in:
-
FP16 (Half Precision) or BF16 (bfloat16): Each parameter uses 2 bytes. This is the most common and recommended precision for inference.
\[2.5 \text{B parameters} \cdot 2 \text{ bytes/parameter} = 5 \text{ GB}\] -
INT8 (8-bit Quantization): Each parameter uses 1 byte.
\[2.5 \text{B parameters} \cdot 1 \text{ byte/parameter} = 2.5 \text{ GB}\]
vLLM typically defaults to FP16 (or bfloat16
if supported) for an optimal balance of speed and memory. Therefore, the model weights will consume approximately 5 GB.
Is the
bfloat16 same as float16 , or this is some new number format?Brain Floating Point format explanation
|
1.2. KV Cache (Key-Value Cache)
The KV Cache memory usage is dynamic and depends on several factors:
-
max_model_len
(ormax_context_len
): The maximum sequence length (input prompt + generated output) that the model will process. A longermax_model_len
means a larger potential KV Cache. -
Number of Attention Heads and Hidden Dimensions: These are model-specific architectural parameters.
-
Batch Size / Concurrent Requests: More concurrent requests mean more KV Cache entries.
-
gpu-memory-utilization
: vLLM’s parameter that controls the fraction of total GPU memory to be used for the model executor, including the KV Cache. By default, it’s 0.9 (90%).
1.3. General Estimation for KV Cache (for granite-3.3-2b-instruct)
While precise calculation requires knowing the exact attention head configuration and desired max_model_len
, here’s a rough idea:
-
For a 2.5B model, the KV Cache can add a significant amount, often a few gigabytes, and can even exceed the model weights if you’re handling many concurrent, long sequences.
-
For example, a 2.5B model processing around
2048
tokens in FP16 might need an additional ~0.16 GB for the KV Cache per sequence (this is a rough estimate and depends heavily on batch size andmax_model_len
).
The most reliable source for a model’s architectural parameters is its configuration file, usually named config.json
, found alongside the model weights on Hugging Face Hub.
For ibm-granite/granite-3.3-2b-instruct
, you would look for its config.json
file on its Hugging Face model page: granite-3.3-2b-instruct.
granite-3.3-2b-instruct config.json
{
"architectures": [
"GraniteForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"attention_multiplier": 0.015625,
"bos_token_id": 0,
"embedding_multiplier": 12.0,
"eos_token_id": 0,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 8192,
"logits_scaling": 8.0,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "granite",
"num_attention_heads": 32,
"num_hidden_layers": 40,
"num_key_value_heads": 8,
"pad_token_id": 0,
"residual_multiplier": 0.22,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000000.0,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.49.0",
"use_cache": true,
"vocab_size": 49159
}
Math behind the exact estimation of the KV Cache sizeLet’s break down the estimated memory usage for Model Configuration from
|
A common way to calculate the needed KV Cache are calculators like the gaunernst/kv-cache-calculator from Hugging Face.
2. GPU Optimization
Our calculation shows a vRAM requirement of at least 16 GB for a single user at maximum context. If we target 20 GB to support a few concurrent queries, an H100 GPU with 80 GB of vRAM can easily accommodate the model. However, this leaves a significant portion of the GPU unused. To boost utilization, we can leverage the H100’s slicing capabilities, which this course will explore. To boost GPU utilization, we can leverage the H100’s slicing capabilities. The rest of this course will demonstrate how to split the GPU into Multi-Instance GPU (MIG) instances, allowing us to serve up to four models of the same size and configuration concurrently.
See also the GPU partitioning guide developed by the rh-aiservices-bu .
|
NVIDIA GPU Slicing/Sharing Options
1. Time-Slicing (Software-based GPU Sharing)Time-slicing is a software-based technique that allows a single GPU to be shared by multiple processes or containers by dividing its processing time into small intervals. Each process gets a turn to use the GPU in a round-robin fashion. How it works: The GPU scheduler allocates time slices to each process. At the end of a time slice, the scheduler preempts the current execution, saves its context, and switches to the next process. This allows multiple workloads to appear to run concurrently on the same physical GPU. Pros:
Cons:
2. Multi-Instance GPU (MIG)MIG is a hardware-based GPU partitioning feature (NVIDIA Ampere architecture and newer) that allows a single physical GPU to be partitioned into up to seven fully isolated GPU instances, each with its own dedicated compute cores, memory, and memory bandwidth. How it works: The physical GPU is divided into independent "MIG slices" at the hardware level. Each MIG instance acts as a fully functional, smaller GPU. Pros:
Cons:
Multi-Process Service (MPS)NVIDIA MPS is a CUDA feature that allows multiple CUDA applications to run concurrently on a single GPU by consolidating multiple CUDA contexts into a single server process. How it works: An MPS server process manages all client CUDA applications, handling the scheduling and execution of kernels from multiple clients on the GPU. This reduces context switching overhead. Pros:
Cons:
No GPU Partitioning (Default Exclusive Access)By default, Kubernetes workloads are given exclusive access to their allocated GPUs. If a pod requests one GPU, it gets the entire physical GPU. Pros:
Cons:
Summary Comparison:
The choice of slicing option depends heavily on the specific workloads, the GPU hardware available, and the requirements for isolation, predictability, and cost efficiency. |
Combining MIG and Time-Slicing
You can configure the NVIDIA GPU Operator to enable time-slicing within a MIG instance. This means that after you’ve created a MIG instance (which provides hardware isolation from other MIG instances), you can then allow multiple pods to time-slice that specific MIG instance. |
2.1. Multi-Instance GPU
NVIDIA’s Multi-Instance GPU (MIG) is a technology that allows you to partition a single physical NVIDIA data center GPU (like the A100 or H100) into multiple smaller, completely isolated, and independent GPU instances.
It’s like carving up a very powerful cake into several smaller, individual slices. Each slice can then be consumed independently without affecting the others.
The GPU cannot be split arbitrarily; there are supported MIG Profiles which differ by GPU type. For the H100, for example, a valid configuration is 3x MIG 3g.40gb
and 1x MIG 1g.20gb
(refer to the official H100 MIG Profiles documentation for all options). With a configuration like this, multiple models could be served in parallel, with smaller slices left over for experimentation.
At the moment, the following GPUs are supported: A30
, A100
, H100
, H200
, GH200
, and B200
.
To change the MIG profiles, the NVIDIA GPU Operator for OpenShift ClusterPolicy
needs to be configured.
3. Fair resource sharing using Kueue
Building upon optimized serving runtimes and efficient MIG-sliced GPU utilization, Kueue addresses the remaining concerns regarding fair resource sharing and workload prioritization within the OpenShift cluster.
Here are some additional use cases leveraging Kueue’s capabilities:
Use Case 1: Enforcing Fair GPU Quotas Across Teams (Preventing Resource Hogging)
-
Problem: Team A, with its optimized serving runtimes, could inadvertently consume all available MIG-sliced GPU resources, leaving no capacity for Team B’s critical workloads. This leads to unfair access and potential service degradation for Team B.
Use Case 2: Prioritizing Critical Runtimes Over Experiments with Preemption
-
Problem: When the cluster is under heavy load, new or scaling business-critical serving instances might get stuck waiting for resources that are currently consumed by lower-priority experimental workloads (e.g., training jobs, hyperparameter sweeps).
Use Case 3: Managing Burst Capacity for Sporadic High-Priority Workloads
-
Problem: Some high-priority analytical jobs or urgent model retraining tasks might sporadically require a large burst of MIG-sliced GPU resources, temporarily exceeding a team’s typical quota. Without a mechanism to handle this, these jobs might face long delays.
Use Case 4: Supporting Different Pricing Models for GPUs
-
Problem: As an infrastructure provider, customers often seek to pay less for on-demand workloads like training jobs. A "spot instance" model can be implemented, offering discounted GPU resources in exchange for the possibility of preemption. Customers can use unused GPU capacity at a lower cost, but if a higher-priority workload needs the resources, the spot job is interrupted.