Distributed Serving

Distributed Inference with vLLM

Serving large models often leads to memory bottlenecks, such as the dreaded CUDA out of memory error. To tackle this, there are two main solutions:

  • Reduce precision: Utilizing FP8 and lower-bit quantization methods can reduce memory usage. However, this approach may impact accuracy and scalability, and is not sufficient by itself as models grow beyond hundreds of billions of parameters.

  • Distributed inference: Spreading model computations across multiple GPUs or nodes enables scalability and efficiency. This is where distributed architectures like tensor parallelism and pipeline parallelism come into play.

vLLM Architecture and Large Language Model Inference Challenges

LLM inference poses unique challenges compared to training:

  • Unlike training, which focuses purely on throughput with known static shapes, inference requires low latency and dynamic workload handling.

  • Inference workloads must efficiently manage KV caches, speculative decoding, and prefill-to-decode transitions.

  • Large models often exceed single-GPU capacity, requiring advanced parallelization strategies.

To address these issues, vLLM provides:

  • Tensor parallelism to shard each model layer across multiple GPUs within a node.

  • Pipeline parallelism to distribute contiguous sections of model layers across multiple nodes.

  • Data parallelism to distribute data across multiple GPUs, with each GPU holding a copy of the model and processing different data portions concurrently.

  • Expert parallelism to assign specific experts to dedicated GPUs, ensuring efficient utilization and reducing redundancy while distributing batched sequences between GPUs for the attention layers, avoiding KV cache duplication to improve memory efficiency.

Choosing the Right Distributed Inference Strategy

Before implementing distributed inference, it’s essential to determine when to use it and which strategies are most appropriate. The decision depends on your model size, hardware resources, and performance requirements.

Strategy Selection Guidelines

Single GPU (No Distributed Inference)

If your model fits comfortably in a single GPU’s memory, distributed inference is unnecessary. This is the simplest deployment option with minimal overhead.

Single-Node Multi-GPU (Tensor Parallelism)

When your model exceeds single GPU capacity but fits within a single node with multiple GPUs, use tensor parallelism. Set the tensor parallel size equal to the number of available GPUs.

Example Configuration
# 4 GPUs in a single node
tensor_parallel_size = 4
Multi-Node Multi-GPU (Tensor + Pipeline Parallelism)

For models that exceed single-node capacity, combine tensor parallelism with pipeline parallelism. The tensor parallel size represents GPUs per node, while pipeline parallel size represents the number of nodes.

Example Configuration
# 16 GPUs across 2 nodes (8 GPUs per node)
tensor_parallel_size = 8
pipeline_parallel_size = 2