The KV Cache Dilemma of Multi-Replica LLMs — Spreading KV Cache Cluster-Wide with LMCache + llm-d
Have you ever enabled vLLM's Automatic Prefix Caching only to find that responses actually got slower after scaling out? I initially thought it was a configuration issue, but it turned out to be a structural problem. When a load balancer distributes requests randomly, each Pod manages its KV cache in isolation, and cache locality completely breaks down. The paradox is that the more Pods you add, the more your cache hit rate converges toward 1/N.
This article walks step by step through a distributed Prefix Caching architecture that externalizes KV cache to the cluster level by combining LMCache and llm-d, and dramatically reduces TTFT (Time To First Token) through cache-aware routing. The explanation flows from concepts to hands-on configuration, so you should be able to apply it directly to everything from RAG pipelines to multi-tenant SaaS.
One prerequisite: this article is aimed at readers who are already operating Kubernetes and vLLM, or who have foundational knowledge of them. If LLM serving infrastructure is new to you, it's worth reviewing the official vLLM documentation first.
Core Concepts
Why KV Cache Gets Trapped in a Single Instance
When a Transformer model runs inference, the attention layer computes Key and Value matrices for each token. Recomputing these every time is enormously wasteful. That's why KV caching was introduced — when the same prefix (the leading portion of a prompt) repeats, the stored matrices are reused.
vLLM manages this KV cache with PagedAttention. It slices GPU memory into page-sized units like virtual memory, computes fixed prefixes (system prompts, RAG documents, etc.) just once, and retrieves them page by page for subsequent requests. This works extremely well on a single Pod.
The problem starts the moment you spin up multiple Pods.
[Request A] ──▶ [Pod 1] (cache hit ✓)
[Request B] ──▶ [Pod 2] (cache miss ✗, recompute)
[Request C] ──▶ [Pod 1] (cache hit ✓)
[Request D] ──▶ [Pod 3] (cache miss ✗, recompute)When a request carrying the same system prompt is routed to a different Pod, that Pod has no cache and computes from scratch. The more Pods you add, the more the cache hit rate converges toward 1/N.
The Three Pillars of Distributed Prefix Caching
Cluster-level KV caching operates through three interlocking mechanisms.
| Pillar | Role | Component |
|---|---|---|
| KV Cache Externalization | Offloads KV blocks outside GPU HBM (to CPU memory, SSD, remote storage) | LMCache |
| Cache-Aware Routing | Inspects prefix hashes and forwards requests to the Pod holding the corresponding KV blocks | llm-d EPP |
| Prefill/Decode Disaggregation | Runs prompt processing and token generation on separate GPUs, transferring KV over the network | LMCache + llm-d |
TTFT (Time To First Token): The time from sending a request until the first token arrives. It is most directly affected by cache hit rate.
LMCache — A KV Cache Layer Outside the GPU
LMCache is a layer that moves the KV cache of vLLM and SGLang engines outside the GPU. It was incorporated into the official PyTorch ecosystem in 2025 and currently supports eight storage backends.
GPU HBM (L1)
│ on cache miss
▼
CPU DRAM (L2, managed by LMCache)
│ on cache miss
▼
NVMe / Remote Storage (L3)
(NFS, S3, InfiniStore, Mooncake Store, Valkey, etc.)# vLLM + LMCache basic integration
# pip install lmcache vllm
from lmcache.config import LMCacheEngineConfig
from lmcache.integration.vllm import init_lmcache_engine
# Basic configuration using CPU memory as the L2 cache
config = LMCacheEngineConfig.from_dict({
"chunk_size": 256, # 1 chunk = 256 tokens = 1 KV block
"local_device": "cpu", # CPU DRAM as local cache
"remote_url": "redis://cache-server:6379", # Valkey/Redis shared cache
"remote_serde": "cachegen", # Compressed serialization format
})
init_lmcache_engine(config)
# The vLLM engine is then used exactly as before
# LMCache transparently intercepts KV cache managementllm-d — Cache-Aware Routing on Kubernetes
llm-d is a project donated to CNCF Sandbox by IBM Research, Red Hat, and Google Cloud at KubeCon Europe in March 2026. It is jointly supported by AMD, NVIDIA, Hugging Face, and others, and is becoming the standard infrastructure for Kubernetes-based LLM serving.
The core is a cache-aware router called the EPP (Endpoint Picker).
[Client Request]
│
▼
[llm-d Gateway (Kubernetes Gateway API)]
│
▼
[EPP — Endpoint Picker]
├─ Computes prefix hash of the request
├─ Looks up KV block index (llm-d-kv-cache)
└─ Routes to the Pod holding the matching block
│
▼
[vLLM Pod (cache hit!)]The EPP is not a simple load balancer. It tracks each Pod's KV block inventory in real time and preferentially assigns requests to the Pod whose prefix hash matches. It falls back to load-balancing criteria only when no matching cache exists.
Hands-On Configuration
Now that the concepts are clear, let's move to actual configuration. The three examples below are connected to each other, so following them in order will reveal the full picture.
Example 1: Sharing Document KV Cache in a RAG Pipeline
Imagine a RAG service that places hundreds of pages of enterprise documents in the system prompt. Thousands of requests referencing the same document come in every day. The LMCache + shared storage combination is highly effective here.
# lmcache-config.yaml
# All vLLM Pods point to the same remote cache
chunk_size: 512 # 1 chunk = 512 tokens = 1 KV block
local_device: "cpu"
max_local_cache_size: 20 # GB, CPU memory allocation
# NFS mount path used as shared KV cache
remote_url: "file:///mnt/nfs/lmcache"
remote_serde: "safetensor"# RAG server — document KV pre-warming script
# pip install lmcache vllm
import asyncio
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams # vLLM 0.6+
async def prewarm_document_kv(doc_text: str, engine: AsyncLLMEngine):
"""
Processes a document once to populate the KV cache in external storage.
Subsequent requests with the same prefix will hit the cache without recomputation.
"""
system_prompt = f"Please answer based on the following document:\n\n{doc_text}\n\nQuestion: "
# A dummy request generates the prefix KV — LMCache saves the KV to external storage as a side effect
# max_tokens=1 minimizes the cost of actual response generation
params = SamplingParams(max_tokens=1)
async for _ in engine.generate(system_prompt, params, request_id="prewarm"):
pass
print(f"[KV Prewarm] {len(doc_text)} chars → cache saved successfully")
# For subsequent real requests, the same prefix is loaded instantly from LMCache| Component | Role |
|---|---|
chunk_size: 512 |
Stores KV blocks in 512-token units. The longer the document, the wider the hit range |
remote_url: "file:///mnt/nfs/..." |
Shared storage mounted via NFS. All Pods access the same cache |
prewarm_document_kv() |
Pre-caches key documents at service startup to eliminate cold starts |
I was skeptical at first, but applying this to a service that was repeatedly recomputing hundreds of pages of terms-and-conditions documents made the difference unmistakable. I observed TTFT drop by a factor of 3–10× for repeated queries against the same document.
Example 2: Configuring Cache-Aware Routing in a Kubernetes Cluster with llm-d
With the LMCache configuration above providing a shared cache, let's add the llm-d EPP for cache-aware routing.
# llm-d-based InferencePool + EPP deployment example
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: llama-pool
spec:
targetPortNumber: 8000
selector:
matchLabels:
app: vllm-worker
extensionRef:
name: llm-d-epp # EPP performs cache-aware routing
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-d-epp
spec:
template:
spec:
containers:
- name: epp
# In production, use a pinned version tag instead of latest
image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.5.0
env:
- name: KV_CACHE_AWARE_ROUTING
value: "true"
- name: PREFIX_HASH_ALGO
value: "sha256"
- name: KVBLOCK_INDEX_ENDPOINT
value: "http://kv-index-service:9090"# vLLM Worker Pod — with LMCache integration
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-worker
spec:
replicas: 4
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.0
args:
- "--model=meta-llama/Llama-3.1-8B-Instruct"
- "--enable-prefix-caching"
- "--kv-transfer-config"
# JSON-serialize kv_transfer_config and pass it as a CLI argument
- '{"kv_connector":"LMCacheConnector","kv_role":"kv_both"}'
volumeMounts:
- name: lmcache-config
mountPath: /etc/lmcache
volumes:
- name: lmcache-config
configMap:
name: lmcache-cfgIn this configuration, when a request arrives, the EPP computes the prefix hash and selects the Worker Pod that holds KV blocks matching that hash. If no cache exists, it falls back to standard load balancing, and after processing, LMCache saves the KV to shared storage so subsequent Pods can reuse it.
Example 3: Optimizing GPU Utilization with Prefill/Decode Disaggregation
With cache-aware routing in place, long-context services can consider Prefill/Decode disaggregation.
Prefill (prompt processing) is a compute-bound operation that performs matrix arithmetic on thousands of tokens at once, while Decode (token generation) is a memory-bound operation that sequentially reads weights from HBM one token at a time. Because their bottleneck types are completely different, binding them to the same GPU causes Prefill to consume HBM bandwidth and delay Decode, while long Decode runs cause Prefill batches to pile up.
[Client]
│
▼
[llm-d EPP]
│
├──▶ [Prefill Pod (A100 × 4)] ← Specialized for long prompt processing (compute-bound)
│ │
│ KV block batch transfer (LMCache)
│ │
└──▶ [Decode Pod (A100 × 2)] ← Specialized for token generation (memory-bound)# Example of Prefill/Decode role configuration
# In practice, JSON is passed as the --kv-transfer-config CLI argument when starting vLLM
import os
os.environ["LMCACHE_USE_EXPERIMENTAL_FEATURES"] = "True"
# Prefill Pod configuration — generates KV and sends it to the Decode Pod
kv_transfer_config_prefill = {
"kv_connector": "LMCacheConnector",
"kv_role": "kv_producer", # KV producer
"kv_connector_port": 8100,
"kv_connector_host": "0.0.0.0",
}
# Decode Pod configuration — receives KV and begins token generation
kv_transfer_config_decode = {
"kv_connector": "LMCacheConnector",
"kv_role": "kv_consumer", # KV consumer
"kv_connector_port": 8100,
"kv_connector_host": "prefill-service", # Prefill Pod service address
}
# vLLM launch example (Prefill Pod):
# vllm serve meta-llama/Llama-3.1-8B-Instruct \
# --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_producer",...}'Combining all three configurations produces this request flow:
[Client Request]
│
▼
[llm-d EPP] ← Identifies the cache-holding Pod via prefix hash (Example 2)
│
▼
[Prefill Pod] ← Processes the prompt and generates KV (Example 3)
│
│ LMCache → Saves KV to NFS/Valkey shared storage (Example 1)
│
▼
[Decode Pod] ← Receives KV and begins token generation
│
▼
[Client Response]Benchmarks of llm-d's 16×16 B200 topology (16 Prefill nodes, 16 Decode nodes) show TTFT reduced by a single-digit multiple compared to round-robin. Not many organizations have clusters at this scale, but the principle applies equally at smaller sizes.
Pros and Cons Analysis
Advantages
| Item | Detail |
|---|---|
| Cost Reduction | Up to 1/10 the token processing cost on a cache hit (e.g., $3.00 → $0.30 / 1M tokens) |
| Reduced TTFT | Sub-400ms for cache-hit requests per official llm-d benchmarks, up to 85% reduction |
| Horizontal Scalability | New Pods immediately reuse the external cache upon joining; no warm-up needed |
| Fault Resilience | KV cache in external storage survives Pod restarts |
| Throughput Improvement | Up to 2× throughput on the same hardware per llm-d benchmarks |
Disadvantages and Caveats
| Item | Detail | Mitigation |
|---|---|---|
| Network Overhead | KV transfer costs can exceed cache savings | Evaluate high-speed storage such as InfiniStore/Mooncake (RDMA) before adopting |
| LRU Limitations | Simple LRU cannot predict near-future reuse for workloads like agentic workflows | Analyze workload patterns and tune cache policy accordingly |
| Memory Tier Complexity | Managing the GPU HBM → CPU DRAM → NVMe → remote storage hierarchy | Incremental adoption — enable CPU DRAM only first, then expand |
| Security Vulnerability | TTFT measurement can be used to infer other tenants' prompts (Cache Side-Channel) | Mandatory per-tenant cache isolation in multi-tenant environments |
| Operational Complexity | Additional components to manage: KVEvents streaming, block index synchronization, etc. | Set up monitoring dashboards and pre-configure alert thresholds |
| Cache Invalidation | Full cache invalidation required on model updates or LoRA switching | Add a cache flush hook to the deployment pipeline |
Cache Side-Channel Attack: In a multi-tenant environment, an attack that infers another user's prompt content by measuring response time differences caused by cache hits and misses. A detailed analysis is available in the CacheSolidarity paper.
RDMA (Remote Direct Memory Access): A technology that reads and writes memory directly between servers without involving the CPU. It reduces KV block transfer latency to the microsecond range.
The Most Common Mistakes in Practice
Some of these I've experienced firsthand; others are patterns I frequently observe in teams.
-
Not measuring shared storage latency in advance — If NFS or S3 latency exceeds tens of milliseconds, a cache hit can actually be a net loss. It's worth measuring actual latency characteristics with tools like
fiobefore adopting. The choice of storage can completely change the outcome. -
Applying the same
chunk_sizeto all workloads — Small chunks (1 chunk = 128–256 tokens) are efficient for short conversational queries; large chunks (1 chunk = 512–1024 tokens) suit long RAG documents. Optimal values differ by workload, so measure and adjust as you go. -
Deploying in multi-tenant environments without cache isolation — If the tenant ID is not included in the prefix hash, caches are shared across tenants and the Cache Side-Channel security vulnerability described above is introduced. It is strongly recommended to enable tenant isolation options in the EPP configuration.
Closing Thoughts
The paradox where cache hit rate drops as you add more Pods is not a configuration problem — it stems from a structure that traps the KV cache inside a single instance. By moving the KV cache outside the cluster with LMCache and attaching cache-aware routing with the llm-d EPP, the cost and latency profile of multi-replica LLM serving changes fundamentally.
If you're already running vLLM, the barrier to entry is lower than you might think. Before touching the cluster layer, you can already feel the effect by verifying CPU offload alone on a single node.
Three steps you can take right now:
-
Experiment with LMCache standalone (verifiable in 30 minutes) — Install it with
pip install lmcachein your existing vLLM environment, add only thelocal_device: "cpu"configuration, and measure CPU offload hit rate on a single Pod first. The LMCache Quickstart is well-documented. -
Integrate shared storage (requires an NFS or Valkey environment) — If hit rates are meaningful, specify NFS or Valkey as the
remote_urland expand to KV cache sharing across multiple Pods. You can verify the cache-sharing effect at this stage even without llm-d. -
Introduce the llm-d EPP (requires a Kubernetes cluster as a prerequisite) — Once cache-sharing benefits are confirmed, deploy the EPP using the Helm chart from llm-d GitHub and add cache-aware routing to push hit rates even higher.
Next article: Inside llm-d's KVEvents streaming protocol — a code-level look in Go at how the block index stays synchronized in real time.
References
- KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d | llm-d Official Blog
- llm-d Official Architecture Docs | llm-d.ai
- LMCache GitHub | LMCache/LMCache
- llm-d-kv-cache GitHub | llm-d/llm-d-kv-cache
- LMCache Joins PyTorch Ecosystem | pytorch.org
- LMCache Paper | arXiv:2510.09665
- Welcome llm-d to the CNCF | CNCF Official Blog
- Donating llm-d to the CNCF | IBM Research
- llm-d officially a CNCF Sandbox project | Google Cloud Blog
- Master KV cache aware routing with llm-d | Red Hat Developer
- Introduction to distributed inference with llm-d | Red Hat Developer
- KV Caching with vLLM, LMCache, and Ceph | Ceph.io
- CacheSolidarity: Preventing Prefix Caching Side Channels | arXiv
- Disaggregated Prefill with LMCache | vLLM Official Docs
- Cluster-scale KV caching for 9.9× faster LLM inference | Crusoe MemoryAlloy