llm-d Overview
What It Does
llm-d transforms traditional single-server LLM deployments into distributed, Kubernetes-native inference systems. The core innovation is disaggregating the inference process into specialized workloads that can be optimized independently:
-
Prefill pods: Handle prompt processing (compute-intensive)
-
Decode pods: Generate tokens one-by-one (memory-intensive)
This separation allows each phase to use hardware resources optimally while enabling sophisticated request routing based on cache hits, load patterns, and session affinity.
Architecture
llm-d builds on proven open-source technologies while adding advanced distributed inference capabilities. The system integrates seamlessly with existing Kubernetes infrastructure and extends vLLM’s high-performance inference engine with cluster-scale orchestration:
-
Kubernetes-native: Deploys via Helm charts
-
vLLM backend: Uses proven inference engine
-
Smart routing: Routes requests based on KV cache hits, load, and prefixes
-
Distributed caching: Shares computed prefixes across instances
Technical Requirements
Infrastructure
llm-d requires a robust Kubernetes environment with specialized GPU and networking capabilities. The distributed nature of the system demands high-performance interconnects and sufficient GPU memory for model sharding:
-
Kubernetes cluster with NVIDIA GPUs nodes
-
Scale-optimized deployment (designed for 10k-20k daily prompts, no published benchmarks available)
-
High-speed networking (InfiniBand/RDMA preferred)
-
Persistent storage for model weights
Prerequisites
Before deployment, several tools and configurations must be in place. The system integrates with HuggingFace for model access and requires various CLI utilities for installation automation:
-
Kubernetes admin access
-
GPU operators (NVIDIA/AMD)
-
CLI tools: kubectl, helm, yq, jq
-
HuggingFace tokens for model access
Key Features
llm-d implements several advanced optimizations that differentiate it from traditional single-server inference deployments. These features work together to maximize throughput while minimizing resource waste:
-
vLLM-Optimized Inference Scheduler: KV-aware "smart"-routing and load balancing via the Endpoint Picker Protocol (EPP).
-
Disaggregated Prefix caching: Reuses computed prefixes across requests across nodes.
-
Session routing: Routes follow-up requests to same decode worker
-
Monitoring: Built-in Prometheus/Grafana dashboards