llm-d Overview

What It Does

llm-d transforms traditional single-server LLM deployments into distributed, Kubernetes-native inference systems. The core innovation is disaggregating the inference process into specialized workloads that can be optimized independently:

Prefill pods: Handle prompt processing (compute-intensive)
Decode pods: Generate tokens one-by-one (memory-intensive)

This separation allows each phase to use hardware resources optimally while enabling sophisticated request routing based on cache hits, load patterns, and session affinity.

Architecture

llm-d builds on proven open-source technologies while adding advanced distributed inference capabilities. The system integrates seamlessly with existing Kubernetes infrastructure and extends vLLM’s high-performance inference engine with cluster-scale orchestration:

Kubernetes-native: Deploys via Helm charts
vLLM backend: Uses proven inference engine
Smart routing: Routes requests based on KV cache hits, load, and prefixes
Distributed caching: Shares computed prefixes across instances

Technical Requirements

Infrastructure

llm-d requires a robust Kubernetes environment with specialized GPU and networking capabilities. The distributed nature of the system demands high-performance interconnects and sufficient GPU memory for model sharding:

Kubernetes cluster with NVIDIA GPUs nodes
Scale-optimized deployment (designed for 10k-20k daily prompts, no published benchmarks available)
High-speed networking (InfiniBand/RDMA preferred)
Persistent storage for model weights

Prerequisites

Before deployment, several tools and configurations must be in place. The system integrates with HuggingFace for model access and requires various CLI utilities for installation automation:

Kubernetes admin access
GPU operators (NVIDIA/AMD)
CLI tools: kubectl, helm, yq, jq
HuggingFace tokens for model access

Deployment

git clone https://github.com/llm-d/llm-d-deployer.git
cd llm-d-deployer/quickstart
export HF_TOKEN="your-token"
# For pre-requisites
./install-deps.sh
./llmd-installer.sh

Key Features

llm-d implements several advanced optimizations that differentiate it from traditional single-server inference deployments. These features work together to maximize throughput while minimizing resource waste:

vLLM-Optimized Inference Scheduler: KV-aware "smart"-routing and load balancing via the Endpoint Picker Protocol (EPP).
Disaggregated Prefix caching: Reuses computed prefixes across requests across nodes.
Session routing: Routes follow-up requests to same decode worker
Monitoring: Built-in Prometheus/Grafana dashboards