GPU Timeslicing
Overview
GPU timeslicing is a software-based GPU sharing technique that enables multiple workloads to share a single GPU by time-dividing access to GPU compute resources. Unlike hardware-based partitioning methods like Multi-Instance GPU (MIG), timeslicing provides temporal sharing where workloads get sequential access to the full GPU for short time slices.
This approach is particularly valuable for:
-
AI inference workloads that don’t fully utilize GPU resources
-
Development and testing environments requiring flexible GPU access
-
Multi-tenant scenarios where isolation requirements are less stringent
-
Edge deployments with limited GPU resources
-
Cost optimization by maximizing GPU utilization
Timeslicing enables organizations to dramatically improve GPU utilization while reducing costs, making AI workloads more accessible and economical.
How Timeslicing Works
Core Mechanism
GPU timeslicing operates by:
-
Time Division: The GPU scheduler allocates time slices to different workloads in a round-robin fashion
-
Context Switching: When a workload’s time slice ends, the GPU context is saved and the next workload’s context is restored
-
Memory Sharing: All workloads share the same GPU memory space without isolation
-
Equal Time Allocation: By default, each workload receives an equal share of GPU compute time
NVIDIA GPU Architecture Support
Modern NVIDIA GPUs (Pascal architecture and newer) support compute preemption, enabling efficient timeslicing:
-
Hardware preemption allows the GPU to interrupt running kernels
-
Context switching overhead is minimized through hardware optimization
-
Configurable time slices can be adjusted based on workload requirements
Timeslice Configuration
The time slice duration can be configured through nvidia-smi
:
# Configure time slice duration
nvidia-smi compute-policy --set-timeslice=1 # SHORT (2ms)
nvidia-smi compute-policy --set-timeslice=2 # MEDIUM (10ms)
nvidia-smi compute-policy --set-timeslice=3 # LONG (30ms)
Scheduling Behavior
The GPU scheduler manages workload execution by:
-
Equal sharing: Each active process receives equal GPU time regardless of resource requests
-
No proportional allocation: Requesting more GPU "replicas" doesn’t guarantee proportional compute time
-
Memory contention: All workloads compete for the same memory pool
-
No fault isolation: Errors in one workload can potentially affect others
vLLM and Timeslicing
vLLM Architecture Overview
vLLM is a high-performance LLM inference engine that can benefit significantly from GPU timeslicing for serving multiple models or handling multiple concurrent requests.
Key vLLM components that interact with timeslicing:
-
Continuous Batching: vLLM’s scheduler can work with timesliced GPUs to serve multiple requests
-
PagedAttention: Memory management that works efficiently with shared GPU memory
-
KV Cache Management: Optimized for memory-constrained environments common with timeslicing
vLLM Timeslicing Benefits
When using vLLM with timeslicing:
-
Multiple Model Serving: Different vLLM instances can serve different models on the same GPU
-
Improved Throughput: Better utilization through temporal multiplexing
-
Cost Reduction: Serve more models with fewer GPUs
-
Flexibility: Dynamic allocation of GPU resources based on demand
vLLM Configuration for Timeslicing
Configure vLLM for timesliced environments:
# Reduced memory allocation for shared GPU
engine = AsyncLLMEngine.from_engine_args(
EngineArgs(
model="llama-7b",
gpu_memory_utilization=0.5, # Use only 50% of GPU memory
max_num_batched_tokens=1024, # Smaller batch sizes
max_num_seqs=16, # Fewer concurrent sequences
)
)
vLLM Scheduling Considerations
Recent research shows that vLLM’s scheduling overhead can become significant:
-
CPU Scheduling Overhead: Up to 50% of total inference time in some cases
-
Tensor Processing: Major overhead from building input tensors and detokenization
-
Optimization Opportunities: Reduce scheduling frequency, optimize tensor operations
For timesliced environments, consider:
-
Using larger batch sizes when GPU time is available
-
Implementing asynchronous tensor processing
-
Optimizing for burst workloads common in shared GPU scenarios
Timeslicing Fundamentals
-
Configure NVIDIA GPU timeslicing using nvidia-smi
-
Deploy multiple workloads on a single GPU
-
Monitor GPU utilization and context switching overhead
-
Compare performance with and without timeslicing
Configuration and Setup
-
Deploy NVIDIA GPU Operator with timeslicing configuration
-
Configure Kubernetes device plugin for GPU sharing
-
Set up node labeling for heterogeneous GPU configurations
-
Implement dynamic configuration updates