Distributed Serving
Distributed Inference with vLLM
Serving large models often leads to memory bottlenecks, such as the dreaded CUDA out of memory error. To tackle this, there are two main solutions:
-
Reduce precision: Utilizing FP8 and lower-bit quantization methods can reduce memory usage. However, this approach may impact accuracy and scalability, and is not sufficient by itself as models grow beyond hundreds of billions of parameters.
-
Distributed inference: Spreading model computations across multiple GPUs or nodes enables scalability and efficiency. This is where distributed architectures like tensor parallelism and pipeline parallelism come into play.
vLLM Architecture and Large Language Model Inference Challenges
LLM inference poses unique challenges compared to training:
-
Unlike training, which focuses purely on throughput with known static shapes, inference requires low latency and dynamic workload handling.
-
Inference workloads must efficiently manage KV caches, speculative decoding, and prefill-to-decode transitions.
-
Large models often exceed single-GPU capacity, requiring advanced parallelization strategies.
To address these issues, vLLM provides:
-
Tensor parallelism to shard each model layer across multiple GPUs within a node.
-
Pipeline parallelism to distribute contiguous sections of model layers across multiple nodes.
-
Data parallelism to distribute data across multiple GPUs, with each GPU holding a copy of the model and processing different data portions concurrently.
-
Expert parallelism to assign specific experts to dedicated GPUs, ensuring efficient utilization and reducing redundancy while distributing batched sequences between GPUs for the attention layers, avoiding KV cache duplication to improve memory efficiency.
Choosing the Right Distributed Inference Strategy
Before implementing distributed inference, it’s essential to determine when to use it and which strategies are most appropriate. The decision depends on your model size, hardware resources, and performance requirements.
Strategy Selection Guidelines
- Single GPU (No Distributed Inference)
-
If your model fits comfortably in a single GPU’s memory, distributed inference is unnecessary. This is the simplest deployment option with minimal overhead.
- Single-Node Multi-GPU (Tensor Parallelism)
-
When your model exceeds single GPU capacity but fits within a single node with multiple GPUs, use tensor parallelism. Set the tensor parallel size equal to the number of available GPUs.
Example Configuration# 4 GPUs in a single node tensor_parallel_size = 4
- Multi-Node Multi-GPU (Tensor + Pipeline Parallelism)
-
For models that exceed single-node capacity, combine tensor parallelism with pipeline parallelism. The tensor parallel size represents GPUs per node, while pipeline parallel size represents the number of nodes.
Example Configuration# 16 GPUs across 2 nodes (8 GPUs per node) tensor_parallel_size = 8 pipeline_parallel_size = 2
Useful links
-
Distributed Inference with vLLM - GPU Parallelism Techniques
https://developers.redhat.com/articles/2025/02/06/distributed-inference-with-vllm#gpu_parallelism_techniques_in_vllm -
How We Optimized vLLM DeepSeek R1 - Open Infra Week Contributions
https://developers.redhat.com/articles/2025/03/19/how-we-optimized-vllm-deepseek-r1#open_infra_week_contributions -
GPU Partitioning Guide
https://github.com/rh-aiservices-bu/gpu-partitioning-guide