GPU Sharing

GPU Sharing Overview

GPU sharing is a technique that allows multiple workloads to share a GPU. This is useful for small workloads that don’t require full GPU power simultaneously. This can be broken down into three maing components:

Time-slicing
Multi-Instance GPUs (MIG)
CUDA Multi-Process Service (MPS)

Figure 1. GPU Partitioning Strategies

Pros and Cons of GPU Sharing Methods

Strategy	Benefits	Cons
Time-slicing	Allows multiple workloads to share a GPU by interleaving execution.	No isolation between workloads sharing the same GPU, leading to potential crashes affecting all workloads.
	Useful for small workloads that don’t require full GPU power simultaneously.	Less control over resource allocation, making it less suitable for production environments.
		Can lead to contention and suboptimal performance if workloads compete for GPU resources.
		Not recommended for most GPU sharing scenarios.
Multi-Instance GPUs (MIG)	Provides strict isolation between workloads by partitioning the GPU into smaller instances.	Only supported on NVIDIA Ampere GPUs.
	Each instance behaves like a separate GPU with its own memory and compute resources.	Less flexible than CUDA Multi-Process Service (MPS).
	Ensures memory protection and error isolation.	Requires proper configuration and understanding of the underlying GPU architecture.
	Suitable for workloads with varying resource requirements and need for isolation.	Limited to specific GPU models (e.g., A100).
CUDA Multi-Process Service (MPS)	Enables fine-grained sharing of GPU resources among multiple pods by running CUDA kernels concurrently.	Does not provide full memory protection and error isolation between processes.
	Feels similar to CPU and memory resource allocation in Kubernetes.	Sharing with MPS is not supported on devices with MIG enabled.
	Supported on almost every CUDA-compatible GPU.	May not be suitable for workloads requiring strict isolation.
	Suitable for efficiently sharing GPU resources without strict isolation requirements.	Requires careful management to avoid resource contention and ensure optimal performance.

Strategy

Benefits

Cons

Time-slicing

Allows multiple workloads to share a GPU by interleaving execution.

No isolation between workloads sharing the same GPU, leading to potential crashes affecting all workloads.

Useful for small workloads that don’t require full GPU power simultaneously.

Less control over resource allocation, making it less suitable for production environments.

Can lead to contention and suboptimal performance if workloads compete for GPU resources.

Not recommended for most GPU sharing scenarios.

Multi-Instance GPUs (MIG)

Provides strict isolation between workloads by partitioning the GPU into smaller instances.

Only supported on NVIDIA Ampere GPUs.

Each instance behaves like a separate GPU with its own memory and compute resources.

Less flexible than CUDA Multi-Process Service (MPS).

Ensures memory protection and error isolation.

Requires proper configuration and understanding of the underlying GPU architecture.

Suitable for workloads with varying resource requirements and need for isolation.

Limited to specific GPU models (e.g., A100).

CUDA Multi-Process Service (MPS)

Enables fine-grained sharing of GPU resources among multiple pods by running CUDA kernels concurrently.

Does not provide full memory protection and error isolation between processes.

Feels similar to CPU and memory resource allocation in Kubernetes.

Sharing with MPS is not supported on devices with MIG enabled.

Supported on almost every CUDA-compatible GPU.

May not be suitable for workloads requiring strict isolation.

Suitable for efficiently sharing GPU resources without strict isolation requirements.

Requires careful management to avoid resource contention and ensure optimal performance.