Lab Introdcution: GPU as a Service with GPU slicing and Kueue

This lab explores GPU slicing and its application in workload prioritization. To fully grasp the significance of these topics, a solid understanding of workload sizing is essential. Therefore, we will begin by demonstrating the vRAM calculation required to serve an ibm-granite/granite-7b-base model. This example will then underscore the importance of slicing GPUs into smaller units.

1. Model Seize

We previously introduced vLLM (Red Hat Inference Server) and its use of PagedAttention, a technique that optimizes GPU memory management, reducing the memory required to serve models. To further enhance GPU utilization and maximize the return on investment (ROI) of this expensive hardware, additional optimizations are crucial. Let’s demonstrate this by calculating the memory consumption when serving the ibm-granite/granite-7b-base model.

  • Model Weights: This is the memory required to load the model’s parameters into the GPU.

  • KV Cache (Key-Value Cache): This is dynamic memory used to store the attention keys and values for active sequences (prompts and generated tokens). vLLM’s PagedAttention mechanism is designed to optimize this, but it still consumes significant memory, especially with high concurrency and long sequence lengths.

1.1. Model Weights

The granite-7b-base model is a 7-billion parameter model. The memory usage for model weights depends on the data type (precision) you load it in:

  • FP16 (Half Precision) or BF16 (BFloat16): Each parameter uses 2 bytes. This is the most common and recommended precision for inference to save memory while maintaining good performance.

    • 7B parameters * 2 bytes/parameter = 14 GB

  • INT8 (8-bit Quantization): Each parameter uses 1 byte.

    • 7B parameters * 1 byte/parameter = 7 GB

vLLM typically defaults to FP16 (or bfloat16 if supported by the GPU) for optimal balance of speed and memory, so expect around 14 GB for the model weights alone. If you explicitly use --dtype float16 or --dtype bfloat16, this will be the case.

1.2. KV Cache (Key-Value Cache)

The KV cache memory usage is dynamic and depends on several factors:

  • max_model_len (or max_context_len): The maximum sequence length (input prompt + generated output) that the model will process. A longer max_model_len means a larger potential KV cache.

  • Number of Attention Heads and Hidden Dimensions: These are model-specific architectural parameters.

  • Batch Size / Concurrent Requests: More concurrent requests mean more KV cache entries.

  • gpu-memory-utilization: vLLM’s parameter that controls the fraction of total GPU memory to be used for the model executor, including the KV cache. By default, it’s 0.9 (90%).

1.3. General Estimation for KV Cache (for granite-7b-base)

While precise calculation requires knowing the exact attention head configuration and desired max_model_len, here’s a rough idea:

  • For a 7B model, the KV cache can add a significant amount, often a few GBs, and can even exceed the model weights if you’re handling many concurrent, long sequences.

  • For example, a 7B model processing around 2,048 tokens in FP16 might need an additional ~1.95 GB for the KV cache per batch (this is a rough estimate and depends heavily on batch size and max_model_len).

The most reliable source for a model’s architectural parameters is its configuration file, usually named config.json, found alongside the model weights on Hugging Face Hub.

For ibm-granite/granite-7b-base, you would look for its config.json file on its Hugging Face model page: granite-7b-base.

granite-7b-base config.json
{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.36.2",
  "use_cache": true,
  "vocab_size": 32000
}
OPTIONAL: Math behind the exact estimation of the KV Cache size

Let’s break down the estimated memory usage for ibm-granite/granite-7b-base.

Model Configuration from config.json

Based on the config.json for ibm-granite/granite-7b-base (as found on Hugging Face), here are the relevant parameters:

  • hidden_size: 4096

  • num_attention_heads: 32

  • num_hidden_layers: 32

  • max_position_embeddings: 4096 (This will be our max_model_len for calculation purposes)

a. Calculate head_dim

The dimension of each attention head is hidden_size divided by num_attention_heads:

head_dim = hidden_size / num_attention_heads = 4096 / 32 = 128

b. Calculate KV Cache Size per Token (per layer)

For each token, in each layer, we store a Key (K) and a Value (V) vector. Both K and V have head_dim dimensions for each attention head.

KV Cache Size per token (per layer) = 2 * num_attention_heads * head_dim * bytes_per_value KV Cache Size per token (per layer) = 2 * 32 * 128 * 2 bytes/value KV Cache Size per token (per layer) = 16384 bytes KV Cache Size per token (per layer) = 16 KB

c. Total KV Cache Size (for max_model_len, single sequence)

Now, we multiply by the max_model_len (from max_position_embeddings) and num_hidden_layers to get the maximum KV cache size for one full-length sequence:

Total KV Cache Size (single sequence) = KV Cache Size per token (per layer) * max_model_len * num_hidden_layers Total KV Cache Size (single sequence) = 16 KB/token/layer * 4096 tokens * 32 layers Total KV Cache Size (single sequence) = 16 KB/token/layer * 131072 token-layers Total KV Cache Size (single sequence) = 2097152 KB Total KV Cache Size (single sequence) = 2048 MB Total KV Cache Size (single sequence) = 2 GB

This 2 GB is the maximum KV cache memory required for a single sequence that utilizes the full 4096 context window.

Total Estimated GPU Memory for ibm-granite/granite-7b-base on vLLM (FP16)

Combining the model weights (FP16) and a typical KV cache for vLLM serving:

  • Model Weights (FP16): Approximately 14 GB

  • KV Cache (max single sequence): Approximately 2 GB

Total Minimum GPU Memory: 14 GB (model weights) + 2 GB (max single sequence KV cache) = 16 GB

However, this is just for one active sequence. vLLM is designed for high throughput, meaning it handles multiple concurrent requests.

If you have, for example, 5 concurrent sequences each using roughly half of the max_model_len on average, the KV cache could easily demand another 5-10 GB (or more, depending on gpu-memory-utilization).

Therefore, for comfortable serving of ibm-granite/granite-7b-base on vLLM:

  • A GPU with 16GB vRAM (like an RTX 4080 or a lower-tier A100) might just barely fit if you strictly limit concurrency and context length.

  • 24GB VRAM (like an RTX 3090, 4090, or a full A100) offers much more headroom for the KV cache to scale with concurrent requests and longer sequence lengths, making it a much more suitable choice for production serving.

  • If you need to fit it on smaller GPUs (e.g., 12GB), you would absolutely need to use 8-bit or 4-bit quantization for the model weights, which would bring the base model memory down to 7 GB or 3.5 GB respectively, leaving more room for the KV cache.

A common way to calculate the needed KV Cache are calculators like the gaunernst/kv-cache-calculator from Hugging Face.

2. GPU Optimazation

Our previous example showed a vRAM requirement of 16 GB for a single user. Assuming we target 20 GB of vRAM for a few concurrent queries, an H100 GPU with 80 GB of vRAM can certainly accommodate the model. However, this leaves a significant portion of GPU capacity unused, leading to a low return on investment (ROI). To boost GPU utilization, we can leverage the H100's slicing capabilities. The rest of this course will demonstrate how to split the GPU into Multi-Instance GPU (MIG) instances, allowing us to serve up to four models of the same size and configuration concurrently. See also the GPU partitioning guide developed by the rh-aiservices-bu.

OPTINAL: NVIDIA GPU Slicing/Sharing Options

1. Time-Slicing (Software-based GPU Sharing)

Time-slicing is a software-based technique that allows a single GPU to be shared by multiple processes or containers by dividing its processing time into small intervals. Each process gets a turn to use the GPU in a round-robin fashion.

How it works:

  • The GPU scheduler allocates time slices to each virtual GPU (vGPU) or process.

  • At the end of a time slice, the GPU scheduler preempts the current execution, saves its context, and switches to the next process’s context.

  • This allows multiple workloads to appear to run concurrently on the same physical GPU.

Pros:

  • Cost Efficiency: Maximizes the utilization of expensive GPUs, as multiple applications can share a single GPU instead of requiring dedicated ones. This is particularly beneficial for small-to-medium sized workloads that don’t fully utilize a GPU.

  • Concurrency: Enables multiple applications or users to access the GPU simultaneously.

  • Broad Compatibility: Works with almost all NVIDIA GPU architectures, including older generations that don’t support MIG.

  • Flexibility: Can accommodate a variety of workloads, from machine learning to graphics rendering.

  • Simple to Implement (in Kubernetes): Can be configured using the NVIDIA GPU operator and device plugin in Kubernetes, by specifying the number of replicas for a GPU resource.

Cons:

  • No Memory or Fault Isolation: This is a significant drawback. If one task crashes or misbehaves, it can potentially affect other tasks sharing the same GPU. There’s no hardware-enforced isolation.

  • Potential Latency/Overhead: The context switching between tasks introduces some overhead, which can impact real-time or latency-sensitive applications. Performance can be less predictable compared to dedicated resources.

  • Resource Starvation (without proper management): Without careful configuration, some tasks might get more GPU time than others, leading to performance degradation or starvation for less prioritized workloads.

  • No Fixed Resource Guarantees: While time is shared, there’s no guarantee of a fixed amount of memory or compute resources for each "slice," which can lead to unpredictable performance.

Time-slicing policies (e.g., in NVIDIA vGPU):

  • Best Effort: Default, round-robin. Aims for maximum utilization but can suffer from "noisy neighbor" issues.

  • Equal Share: Distributes compute time evenly, dynamically adjusting as vGPUs are added or removed. Avoids starvation but can lead to underutilization if some vGPUs are idle.

  • Fixed Share: Provides a fixed compute allocation based on vGPU profile size, ensuring consistent performance but potentially resulting in underutilization if vGPUs are idle.

2. Multi-Instance GPU (MIG)

MIG is a hardware-based GPU partitioning feature introduced with NVIDIA Ampere architecture GPUs (A100, A30, H100, etc.). It allows a single physical GPU to be partitioned into up to seven fully isolated GPU instances, each with its own dedicated compute cores, memory, and memory bandwidth.

How it works:

  • The physical GPU is divided into independent "MIG slices" at the hardware level.

  • Each MIG instance acts as a fully functional, smaller GPU.

  • Workloads running on different MIG instances are completely isolated from each other.

Pros:

  • Hardware Isolation: Provides strong memory and fault isolation between instances. A crash or misbehavior in one MIG instance does not affect others.

  • Predictable Performance: Each instance has dedicated resources, offering consistent and predictable performance guarantees, which is crucial for critical workloads.

  • Optimized Resource Utilization (for diverse workloads): Efficiently shares GPU resources among multiple users and workloads that have varying requirements, maximizing utilization without compromising performance predictability.

  • Dynamic Partitioning: Administrators can dynamically adjust the number and size of MIG instances to adapt to changing workload demands.

  • Enhanced Security: Hardware isolation prevents potential security breaches or data leaks between instances.

Cons:

  • Hardware Requirement: Only supported on NVIDIA Ampere and Hopper architecture GPUs (A100, A30, H100, etc.). Not available on older GPUs.

  • Coarse-Grained Control: While it provides isolation, the partitioning is based on predefined MIG profiles, which might not always perfectly align with every workload’s exact resource needs.

  • Fixed Resource Allocation: Once an MIG instance is created with a specific profile, its resources are fixed, which might lead to some underutilization if a workload doesn’t fully consume its allocated MIG slice.

  • Complexity: Setting up and managing MIG can be more complex than simple time-slicing.

3. Multi-Process Service (MPS)

NVIDIA MPS is a CUDA feature that allows multiple CUDA applications to concurrently execute on a single GPU. It achieves this by consolidating multiple CUDA contexts into a single process, which then submits work to the GPU.

How it works:

  • An MPS server process manages all client CUDA applications.

  • The MPS server handles the scheduling and execution of kernels from multiple clients on the GPU.

  • It reduces context switching overhead by sharing a single set of GPU scheduling resources among its clients.

Pros:

  • Improved GPU Utilization: Allows kernels and memory copy operations from different processes to overlap on the GPU, leading to higher utilization.

  • Reduced Overhead: Minimizes context switching overhead compared to default time-slicing, as it maintains fewer GPU contexts.

  • Concurrent Execution: Enables multiple CUDA applications to run in parallel on the same GPU.

  • Fine-Grained Control (with some limitations): Can offer some level of control over resource allocation, though not as strict as MIG.

Cons:

  • No Memory Protection/Error Isolation: Similar to time-slicing, MPS generally doesn’t provide strong memory or fault isolation between clients. A misbehaving client can impact others.

  • Limited to CUDA Applications: Primarily designed for CUDA workloads.

  • Compatibility: While supported by most GPU architectures, combining MPS with MIG is currently not supported by the NVIDIA GPU operator.

  • Potential for Undefined States: Terminating an MPS client without proper synchronization can leave the MPS server and other clients in an undefined state.

4. No GPU Partitioning (Default Exclusive Access)

By default, Kubernetes workloads are given exclusive access to their allocated GPUs. If a pod requests one GPU, it gets the entire physical GPU.

Pros:

  • Simplicity: Easiest to configure and manage, as no special slicing mechanisms are needed.

  • Maximum Performance for Single Workload: A single workload has dedicated access to the entire GPU, ensuring maximum performance and predictability for that specific task.

  • Full Isolation (at the GPU level): Each workload runs on its own GPU, providing complete isolation from other workloads on different GPUs.

Cons:

  • Low GPU Utilization: If a workload doesn’t fully saturate the GPU, significant computational power can be wasted, leading to underutilization of expensive hardware.

  • Higher Costs: Requires more GPUs to run multiple smaller workloads concurrently, increasing infrastructure costs.

  • Inefficient for Small Workloads: Not suitable for many small-to-medium sized tasks that could easily share a GPU.

Summary Comparison:

Feature/Option

Time-Slicing

Multi-Instance GPU (MIG)

Multi-Process Service (MPS)

Default (Exclusive Access)

Method

Software-based time sharing

Hardware-level partitioning

Software-based context consolidation

Full GPU allocation

Isolation

None (shared memory/fault domain)

Hardware-enforced (dedicated memory/compute)

Limited/None (shared memory/fault domain)

Full (dedicated GPU)

Predictable Perf.

Low (noisy neighbor potential)

High (dedicated resources)

Medium (better than time-slicing, but not isolated)

High (full access)

GPU Utilization

High (for multiple small workloads)

High (for diverse workloads with isolation needs)

High (for concurrent CUDA apps)

Low (for small workloads)

Hardware Req.

All NVIDIA GPUs

Ampere/Hopper architecture (A100, A30, H100, etc.)

Most NVIDIA GPUs (CUDA-focused)

All NVIDIA GPUs

Use Case

Multiple small, non-critical, or batch workloads

Mixed workloads requiring isolation and predictability

Concurrent CUDA applications, improved throughput

Large, performance-critical, single-task workloads

Complexity

Medium

High

Medium

Low

The choice of slicing option depends heavily on the specific workloads, the GPU hardware available, and the requirements for isolation, predictability, and cost efficiency.

2.1. Muli Instance GPU

NVIDIA’s Multi-Instance GPU (MIG) is a groundbreaking technology that allows you to partition a single physical NVIDIA data center GPU (like the A100 or H100) into multiple smaller, completely isolated, and independent GPU instances.

It’s like carving up a very powerful, single-slice cake into several smaller, perfectly formed individual slices. Each slice can then be consumed independently, without affecting the others.

The GPU can’t be split random there are supported MIG Profiles which are differ by GPU type. For the example here H100 MIG Profiles could be 3xMIG 3g.20gb and 1xMIG 1g.10gb. With these configuration 3 Models could be served in parallel and it would be still a 1g.10gb` slice left which could be used for experiments. At the moment the following GPU’s are supported; A30, A100, H100, H200 and B200. To change the MIg profiles the NVIDIA GPU Operator for OpenShift needs to be modified.

3. Fair resource sharing using Kueue

Building upon the optimized serving runtimes and efficient MIG-sliced GPU utilization, Kueue directly addresses remaining concerns regarding fair resource sharing and workload prioritization within the OpenShift cluster.

Here are some additional use cases, leveraging Kueue’s capabilities:

Use Case 1: Enforcing Fair GPU Quotas Across Teams (Preventing Resource Hogging)

  • Problem: Team A, with its optimized serving runtimes and potentially many experimentation jobs, could inadvertently (or intentionally) consume all available MIG-sliced GPU resources, leaving no capacity for Team B’s critical serving runtimes or development efforts. This leads to unfair access, increased waiting times for Team B, and potential service degradation.

Use Case 2: Prioritizing Critical Serving Runtimes Over Experiments with Preemption

  • Problem: When the cluster is under heavy load, new or scaling serving runtime instances (which are business-critical) might get stuck waiting for resources behind lower-priority experimental workloads (e.g., ad-hoc training jobs, hyperparameter sweeps) that are already running and consuming MIG-sliced GPUs. This can impact service availability and user experience.

Use Case 3: Managing Burst Capacity for Sporadic High-Priority Workloads

  • Problem: Some high-priority analytical jobs or urgent model retraining tasks might sporadically require a large burst of MIG-sliced GPU resources, temporarily exceeding a team’s typical quota. Without a mechanism to handle this, these jobs might face long delays.

Use Case 4: Different pricing Models for GPUs

  • Problem: As infrastructur provider customers want to pay less money for on-demand workloads like Training Jobs. This model offers significantly discounted GPU resources in exchange for the possibility of preemption. Customers bid for unused GPU capacity, and if a higher-priority or on-demand workload needs the resources, the spot instance is interrupted (preempted) with typically a short warning.