Pipeline Parallelism

Overview

When models become so large that they exceed the memory capacity of multiple GPUs within a single node, pipeline parallelism provides a solution for distributed inference across multiple nodes. This technique partitions the model into sequential stages, with each stage containing contiguous layers that are processed by different GPUs or nodes.

Pipeline parallelism differs from tensor parallelism in that it splits the model vertically (across layers) rather than horizontally (across tensor dimensions). Each GPU or node in the pipeline processes a complete subset of the model’s layers, making it particularly effective for extremely large models like DeepSeek R1 or Llama 3.1 405B that cannot fit on a single node.

How Pipeline Parallelism Works

The core principle involves sequential processing through model stages:

  • Layer partitioning: The model is divided into contiguous blocks of layers, with each stage assigned to a different GPU or node

  • Sequential processing: Input flows through stages sequentially, with each stage processing its assigned layers

  • Activation passing: Intermediate activations are transmitted between stages as computation progresses

  • Minimal communication: Data transfer occurs only between adjacent pipeline stages, reducing overall communication overhead

For example, in a 4-stage pipeline, Stage 1 processes layers 1-8, Stage 2 processes layers 9-16, and so on. Each stage completes its computation before passing activations to the next stage in the pipeline.

Pipeline Parallelism in vLLM

vLLM implements pipeline parallelism with advanced scheduling optimizations to address the inherent challenges of sequential processing. Unlike tensor parallelism, which can reduce latency through parallel computation, pipeline parallelism focuses on enabling deployment of models that would otherwise be impossible to serve.

Pipeline parallel approach
Figure 1. Pipeline parallel approach with sequential stage processing

Implementation Techniques

Pipeline parallelism in vLLM uses sophisticated scheduling strategies to maximize throughput:

Micro-batch Processing

Instead of processing one request at a time, vLLM processes multiple micro-batches simultaneously across different pipeline stages. This keeps all GPUs active and improves overall throughput.

Asynchronous Execution

Pipeline stages operate asynchronously, allowing multiple requests to be processed concurrently at different stages. This reduces pipeline bubbles and idle time.

Dynamic Load Balancing

vLLM automatically adjusts the number of layers per stage based on computational complexity and memory requirements to optimize performance across the pipeline.

Pipeline scheduling optimization
Figure 2. Pipeline scheduling with micro-batch processing

Performance Characteristics

Pipeline parallelism provides different performance trade-offs compared to tensor parallelism:

Memory Efficiency
  • Reduced per-GPU memory: Each GPU only stores a subset of model layers

  • Scalability: Enables serving models that exceed single-node memory capacity

  • Linear scaling: Memory requirements scale linearly with the number of pipeline stages

Throughput Optimization
  • Batch processing: Higher throughput achieved through micro-batch scheduling

  • Pipeline utilization: Advanced scheduling keeps all stages busy

  • Communication efficiency: Lower communication overhead compared to tensor parallelism

Latency Considerations
  • Sequential processing: Inherently higher latency than tensor parallelism due to sequential stage execution

  • Pipeline depth: Deeper pipelines increase latency but enable larger models

  • Micro-batch trade-offs: Smaller micro-batches reduce latency but may decrease throughput

Hardware Requirements

Pipeline parallelism has different networking requirements compared to tensor parallelism:

Inter-node Communication
  • Lower bandwidth requirements: Only adjacent stages communicate, reducing network pressure

  • Ethernet compatibility: Can work effectively with standard Ethernet connections

  • Reduced communication frequency: Data transfer occurs once per pipeline stage

Node Configuration
  • Homogeneous nodes: Each pipeline stage should have similar computational capacity

  • Memory distribution: Memory requirements are distributed across nodes

  • Storage considerations: Model weights are distributed, reducing per-node storage needs

When to Use Pipeline Parallelism

Pipeline parallelism is most effective when:

  • Model size exceeds node capacity: Required for models that cannot fit on a single node even with tensor parallelism

  • Multi-node deployment: Available infrastructure spans multiple nodes

  • Throughput priority: Batch processing throughput is more important than individual request latency

  • Limited high-bandwidth interconnects: When NVLink or InfiniBand are not available between nodes

Combining Pipeline and Tensor Parallelism

For optimal performance with extremely large models, vLLM supports combining both parallelism strategies:

  • Tensor parallelism within nodes: Distribute layers across GPUs within each node

  • Pipeline parallelism across nodes: Distribute layer groups across multiple nodes

  • Hybrid scaling: Achieve maximum model capacity and performance through combined approaches

Example Configuration
# 8 nodes, 8 GPUs per node, 64 total GPUs
tensor_parallel_size = 8    # GPUs per node
pipeline_parallel_size = 8  # Number of nodes

Distributed Runtime Management

vLLM supports distributed tensor-parallel and pipeline-parallel inference through flexible runtime management systems. The framework implements Megatron-LM’s tensor parallel algorithm and provides multiple execution backends to handle distributed coordination.

Execution Backends

vLLM manages distributed runtime using two primary execution backends:

Multiprocessing Backend

Python’s native multiprocessing is used for single-node deployments when sufficient GPUs are available on the same node. This backend provides:

  • Lower overhead for single-node scenarios

  • No external dependencies beyond Python standard library

  • Automatic selection when conditions are met

  • Optimal for tensor parallelism within a single node

Ray Backend

Ray is used for multi-node inference and complex distributed scenarios. This backend enables:

  • Multi-node multi-GPU deployments

  • Advanced scheduling and resource management

  • Fault tolerance and dynamic scaling

  • Required for pipeline parallelism across nodes

Backend Selection Logic

vLLM automatically selects the appropriate backend based on deployment configuration:

Default Behavior
  • Multiprocessing: Used when not running in a Ray placement group and sufficient GPUs are available on the same node for the configured tensor_parallel_size

  • Ray: Used for multi-node deployments or when multiprocessing conditions are not met

Manual Override

The default backend selection can be overridden using:

  • LLM class: distributed_executor_backend argument

  • API server: --distributed-executor-backend argument

  • Values: mp for multiprocessing, ray for Ray