The KV Cache Dilemma of Multi-Replica LLMs — Spreading KV Cache Cluster-Wide with LMCache + llm-d

Have you ever enabled vLLM's Automatic Prefix Caching only to find that responses actually got slower after scaling out? I initially thought it was a configuration issue, but it turned out to be a structural problem. When a load balancer distributes requests randomly, each Pod manages its KV cache in isolation, and cache locality completely breaks down. The paradox is that the more Pods you add, the more your cache hit rate converges toward 1/N.

This article walks step by step through a distributed Prefix Caching architecture that externalizes KV cache to the cluster level by combining LMCache and llm-d, and dramatically reduces TTFT (Time To First Token) through cache-aware routing. The explanation flows from concepts to hands-on configuration, so you should be able to apply it directly to everything from RAG pipelines to multi-tenant SaaS.

One prerequisite: this article is aimed at readers who are already operating Kubernetes and vLLM, or who have foundational knowledge of them. If LLM serving infrastructure is new to you, it's worth reviewing the official vLLM documentation first.

Core Concepts

Why KV Cache Gets Trapped in a Single Instance

When a Transformer model runs inference, the attention layer computes Key and Value matrices for each token. Recomputing these every time is enormously wasteful. That's why KV caching was introduced — when the same prefix (the leading portion of a prompt) repeats, the stored matrices are reused.

vLLM manages this KV cache with PagedAttention. It slices GPU memory into page-sized units like virtual memory, computes fixed prefixes (system prompts, RAG documents, etc.) just once, and retrieves them page by page for subsequent requests. This works extremely well on a single Pod.

The problem starts the moment you spin up multiple Pods.

css

[Request A] ──▶ [Pod 1] (cache hit ✓)
[Request B] ──▶ [Pod 2] (cache miss ✗, recompute)
[Request C] ──▶ [Pod 1] (cache hit ✓)
[Request D] ──▶ [Pod 3] (cache miss ✗, recompute)

When a request carrying the same system prompt is routed to a different Pod, that Pod has no cache and computes from scratch. The more Pods you add, the more the cache hit rate converges toward 1/N.

The Three Pillars of Distributed Prefix Caching

Cluster-level KV caching operates through three interlocking mechanisms.

Pillar	Role	Component
KV Cache Externalization	Offloads KV blocks outside GPU HBM (to CPU memory, SSD, remote storage)	LMCache
Cache-Aware Routing	Inspects prefix hashes and forwards requests to the Pod holding the corresponding KV blocks	llm-d EPP
Prefill/Decode Disaggregation	Runs prompt processing and token generation on separate GPUs, transferring KV over the network	LMCache + llm-d

TTFT (Time To First Token): The time from sending a request until the first token arrives. It is most directly affected by cache hit rate.

LMCache — A KV Cache Layer Outside the GPU

LMCache is a layer that moves the KV cache of vLLM and SGLang engines outside the GPU. It was incorporated into the official PyTorch ecosystem in 2025 and currently supports eight storage backends.

GPU HBM (L1)
    │ on cache miss
    ▼
CPU DRAM (L2, managed by LMCache)
    │ on cache miss
    ▼
NVMe / Remote Storage (L3)
   (NFS, S3, InfiniStore, Mooncake Store, Valkey, etc.)

python

# vLLM + LMCache basic integration
# pip install lmcache vllm
 
from lmcache.config import LMCacheEngineConfig
from lmcache.integration.vllm import init_lmcache_engine
 
# Basic configuration using CPU memory as the L2 cache
config = LMCacheEngineConfig.from_dict({
    "chunk_size": 256,           # 1 chunk = 256 tokens = 1 KV block
    "local_device": "cpu",       # CPU DRAM as local cache
    "remote_url": "redis://cache-server:6379",  # Valkey/Redis shared cache
    "remote_serde": "cachegen",  # Compressed serialization format
})
 
init_lmcache_engine(config)
 
# The vLLM engine is then used exactly as before
# LMCache transparently intercepts KV cache management

llm-d — Cache-Aware Routing on Kubernetes

llm-d is a project donated to CNCF Sandbox by IBM Research, Red Hat, and Google Cloud at KubeCon Europe in March 2026. It is jointly supported by AMD, NVIDIA, Hugging Face, and others, and is becoming the standard infrastructure for Kubernetes-based LLM serving.

The core is a cache-aware router called the EPP (Endpoint Picker).

css

[Client Request]
      │
      ▼
[llm-d Gateway (Kubernetes Gateway API)]
      │
      ▼
[EPP — Endpoint Picker]
  ├─ Computes prefix hash of the request
  ├─ Looks up KV block index (llm-d-kv-cache)
  └─ Routes to the Pod holding the matching block
      │
      ▼
[vLLM Pod (cache hit!)]

The EPP is not a simple load balancer. It tracks each Pod's KV block inventory in real time and preferentially assigns requests to the Pod whose prefix hash matches. It falls back to load-balancing criteria only when no matching cache exists.

Hands-On Configuration

Now that the concepts are clear, let's move to actual configuration. The three examples below are connected to each other, so following them in order will reveal the full picture.

Imagine a RAG service that places hundreds of pages of enterprise documents in the system prompt. Thousands of requests referencing the same document come in every day. The LMCache + shared storage combination is highly effective here.

yaml

# lmcache-config.yaml
# All vLLM Pods point to the same remote cache
chunk_size: 512         # 1 chunk = 512 tokens = 1 KV block
local_device: "cpu"
max_local_cache_size: 20  # GB, CPU memory allocation
 
# NFS mount path used as shared KV cache
remote_url: "file:///mnt/nfs/lmcache"
remote_serde: "safetensor"

python

# RAG server — document KV pre-warming script
# pip install lmcache vllm
 
import asyncio
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams  # vLLM 0.6+
 
async def prewarm_document_kv(doc_text: str, engine: AsyncLLMEngine):
    """
    Processes a document once to populate the KV cache in external storage.
    Subsequent requests with the same prefix will hit the cache without recomputation.
    """
    system_prompt = f"Please answer based on the following document:\n\n{doc_text}\n\nQuestion: "
    
    # A dummy request generates the prefix KV — LMCache saves the KV to external storage as a side effect
    # max_tokens=1 minimizes the cost of actual response generation
    params = SamplingParams(max_tokens=1)
    async for _ in engine.generate(system_prompt, params, request_id="prewarm"):
        pass
    
    print(f"[KV Prewarm] {len(doc_text)} chars → cache saved successfully")
 
# For subsequent real requests, the same prefix is loaded instantly from LMCache

Component	Role
`chunk_size: 512`	Stores KV blocks in 512-token units. The longer the document, the wider the hit range
`remote_url: "file:///mnt/nfs/..."`	Shared storage mounted via NFS. All Pods access the same cache
`prewarm_document_kv()`	Pre-caches key documents at service startup to eliminate cold starts

I was skeptical at first, but applying this to a service that was repeatedly recomputing hundreds of pages of terms-and-conditions documents made the difference unmistakable. I observed TTFT drop by a factor of 3–10× for repeated queries against the same document.

Example 2: Configuring Cache-Aware Routing in a Kubernetes Cluster with llm-d

With the LMCache configuration above providing a shared cache, let's add the llm-d EPP for cache-aware routing.

yaml

# llm-d-based InferencePool + EPP deployment example
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: llama-pool
spec:
  targetPortNumber: 8000
  selector:
    matchLabels:
      app: vllm-worker
  extensionRef:
    name: llm-d-epp   # EPP performs cache-aware routing
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-d-epp
spec:
  template:
    spec:
      containers:
      - name: epp
        # In production, use a pinned version tag instead of latest
        image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.5.0
        env:
        - name: KV_CACHE_AWARE_ROUTING
          value: "true"
        - name: PREFIX_HASH_ALGO
          value: "sha256"
        - name: KVBLOCK_INDEX_ENDPOINT
          value: "http://kv-index-service:9090"

yaml

# vLLM Worker Pod — with LMCache integration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-worker
spec:
  replicas: 4
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.0
        args:
        - "--model=meta-llama/Llama-3.1-8B-Instruct"
        - "--enable-prefix-caching"
        - "--kv-transfer-config"
        # JSON-serialize kv_transfer_config and pass it as a CLI argument
        - '{"kv_connector":"LMCacheConnector","kv_role":"kv_both"}'
        volumeMounts:
        - name: lmcache-config
          mountPath: /etc/lmcache
      volumes:
      - name: lmcache-config
        configMap:
          name: lmcache-cfg

In this configuration, when a request arrives, the EPP computes the prefix hash and selects the Worker Pod that holds KV blocks matching that hash. If no cache exists, it falls back to standard load balancing, and after processing, LMCache saves the KV to shared storage so subsequent Pods can reuse it.

Example 3: Optimizing GPU Utilization with Prefill/Decode Disaggregation

With cache-aware routing in place, long-context services can consider Prefill/Decode disaggregation.

Prefill (prompt processing) is a compute-bound operation that performs matrix arithmetic on thousands of tokens at once, while Decode (token generation) is a memory-bound operation that sequentially reads weights from HBM one token at a time. Because their bottleneck types are completely different, binding them to the same GPU causes Prefill to consume HBM bandwidth and delay Decode, while long Decode runs cause Prefill batches to pile up.

java

[Client]
     │
     ▼
[llm-d EPP]
     │
     ├──▶ [Prefill Pod (A100 × 4)]  ← Specialized for long prompt processing (compute-bound)
     │           │
     │    KV block batch transfer (LMCache)
     │           │
     └──▶ [Decode Pod (A100 × 2)]  ← Specialized for token generation (memory-bound)

python

# Example of Prefill/Decode role configuration
# In practice, JSON is passed as the --kv-transfer-config CLI argument when starting vLLM
 
import os
os.environ["LMCACHE_USE_EXPERIMENTAL_FEATURES"] = "True"
 
# Prefill Pod configuration — generates KV and sends it to the Decode Pod
kv_transfer_config_prefill = {
    "kv_connector": "LMCacheConnector",
    "kv_role": "kv_producer",        # KV producer
    "kv_connector_port": 8100,
    "kv_connector_host": "0.0.0.0",
}
 
# Decode Pod configuration — receives KV and begins token generation
kv_transfer_config_decode = {
    "kv_connector": "LMCacheConnector",
    "kv_role": "kv_consumer",        # KV consumer
    "kv_connector_port": 8100,
    "kv_connector_host": "prefill-service",  # Prefill Pod service address
}
 
# vLLM launch example (Prefill Pod):
# vllm serve meta-llama/Llama-3.1-8B-Instruct \
#   --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_producer",...}'

Combining all three configurations produces this request flow:

css

[Client Request]
      │
      ▼
[llm-d EPP]  ← Identifies the cache-holding Pod via prefix hash (Example 2)
      │
      ▼
[Prefill Pod]  ← Processes the prompt and generates KV (Example 3)
      │
      │  LMCache → Saves KV to NFS/Valkey shared storage (Example 1)
      │
      ▼
[Decode Pod]  ← Receives KV and begins token generation
      │
      ▼
[Client Response]

Benchmarks of llm-d's 16×16 B200 topology (16 Prefill nodes, 16 Decode nodes) show TTFT reduced by a single-digit multiple compared to round-robin. Not many organizations have clusters at this scale, but the principle applies equally at smaller sizes.

Pros and Cons Analysis

Advantages

Item	Detail
Cost Reduction	Up to 1/10 the token processing cost on a cache hit (e.g., $3.00 → $0.30 / 1M tokens)
Reduced TTFT	Sub-400ms for cache-hit requests per official llm-d benchmarks, up to 85% reduction
Horizontal Scalability	New Pods immediately reuse the external cache upon joining; no warm-up needed
Fault Resilience	KV cache in external storage survives Pod restarts
Throughput Improvement	Up to 2× throughput on the same hardware per llm-d benchmarks

Disadvantages and Caveats

Item	Detail	Mitigation
Network Overhead	KV transfer costs can exceed cache savings	Evaluate high-speed storage such as InfiniStore/Mooncake (RDMA) before adopting
LRU Limitations	Simple LRU cannot predict near-future reuse for workloads like agentic workflows	Analyze workload patterns and tune cache policy accordingly
Memory Tier Complexity	Managing the GPU HBM → CPU DRAM → NVMe → remote storage hierarchy	Incremental adoption — enable CPU DRAM only first, then expand
Security Vulnerability	TTFT measurement can be used to infer other tenants' prompts (Cache Side-Channel)	Mandatory per-tenant cache isolation in multi-tenant environments
Operational Complexity	Additional components to manage: KVEvents streaming, block index synchronization, etc.	Set up monitoring dashboards and pre-configure alert thresholds
Cache Invalidation	Full cache invalidation required on model updates or LoRA switching	Add a cache flush hook to the deployment pipeline

Cache Side-Channel Attack: In a multi-tenant environment, an attack that infers another user's prompt content by measuring response time differences caused by cache hits and misses. A detailed analysis is available in the CacheSolidarity paper.

RDMA (Remote Direct Memory Access): A technology that reads and writes memory directly between servers without involving the CPU. It reduces KV block transfer latency to the microsecond range.

The Most Common Mistakes in Practice

Some of these I've experienced firsthand; others are patterns I frequently observe in teams.

Not measuring shared storage latency in advance — If NFS or S3 latency exceeds tens of milliseconds, a cache hit can actually be a net loss. It's worth measuring actual latency characteristics with tools like fio before adopting. The choice of storage can completely change the outcome.
Applying the same chunk_size to all workloads — Small chunks (1 chunk = 128–256 tokens) are efficient for short conversational queries; large chunks (1 chunk = 512–1024 tokens) suit long RAG documents. Optimal values differ by workload, so measure and adjust as you go.
Deploying in multi-tenant environments without cache isolation — If the tenant ID is not included in the prefix hash, caches are shared across tenants and the Cache Side-Channel security vulnerability described above is introduced. It is strongly recommended to enable tenant isolation options in the EPP configuration.

Closing Thoughts

The paradox where cache hit rate drops as you add more Pods is not a configuration problem — it stems from a structure that traps the KV cache inside a single instance. By moving the KV cache outside the cluster with LMCache and attaching cache-aware routing with the llm-d EPP, the cost and latency profile of multi-replica LLM serving changes fundamentally.

If you're already running vLLM, the barrier to entry is lower than you might think. Before touching the cluster layer, you can already feel the effect by verifying CPU offload alone on a single node.

Three steps you can take right now:

Experiment with LMCache standalone (verifiable in 30 minutes) — Install it with pip install lmcache in your existing vLLM environment, add only the local_device: "cpu" configuration, and measure CPU offload hit rate on a single Pod first. The LMCache Quickstart is well-documented.
Integrate shared storage (requires an NFS or Valkey environment) — If hit rates are meaningful, specify NFS or Valkey as the remote_url and expand to KV cache sharing across multiple Pods. You can verify the cache-sharing effect at this stage even without llm-d.
Introduce the llm-d EPP (requires a Kubernetes cluster as a prerequisite) — Once cache-sharing benefits are confirmed, deploy the EPP using the Helm chart from llm-d GitHub and add cache-aware routing to push hit rates even higher.

Next article: Inside llm-d's KVEvents streaming protocol — a code-level look in Go at how the block index stays synchronized in real time.

References

The KV Cache Dilemma of Multi-Replica LLMs — Spreading KV Cache Cluster-Wide with LMCache + llm-d | DEV BAK - 기술블로그

The KV Cache Dilemma of Multi-Replica LLMs — Spreading KV Cache Cluster-Wide with LMCache + llm-d

Core Concepts

Why KV Cache Gets Trapped in a Single Instance

The problem starts the moment you spin up multiple Pods.

css

[Request A] ──▶ [Pod 1] (cache hit ✓)
[Request B] ──▶ [Pod 2] (cache miss ✗, recompute)
[Request C] ──▶ [Pod 1] (cache hit ✓)
[Request D] ──▶ [Pod 3] (cache miss ✗, recompute)

The Three Pillars of Distributed Prefix Caching

Cluster-level KV caching operates through three interlocking mechanisms.

Pillar	Role	Component
KV Cache Externalization	Offloads KV blocks outside GPU HBM (to CPU memory, SSD, remote storage)	LMCache
Cache-Aware Routing	Inspects prefix hashes and forwards requests to the Pod holding the corresponding KV blocks	llm-d EPP
Prefill/Decode Disaggregation	Runs prompt processing and token generation on separate GPUs, transferring KV over the network	LMCache + llm-d

TTFT (Time To First Token): The time from sending a request until the first token arrives. It is most directly affected by cache hit rate.

LMCache — A KV Cache Layer Outside the GPU

LMCache is a layer that moves the KV cache of vLLM and SGLang engines outside the GPU. It was incorporated into the official PyTorch ecosystem in 2025 and currently supports eight storage backends.

GPU HBM (L1)
    │ on cache miss
    ▼
CPU DRAM (L2, managed by LMCache)
    │ on cache miss
    ▼
NVMe / Remote Storage (L3)
   (NFS, S3, InfiniStore, Mooncake Store, Valkey, etc.)

python

# vLLM + LMCache basic integration
# pip install lmcache vllm
 
from lmcache.config import LMCacheEngineConfig
from lmcache.integration.vllm import init_lmcache_engine
 
# Basic configuration using CPU memory as the L2 cache
config = LMCacheEngineConfig.from_dict({
    "chunk_size": 256,           # 1 chunk = 256 tokens = 1 KV block
    "local_device": "cpu",       # CPU DRAM as local cache
    "remote_url": "redis://cache-server:6379",  # Valkey/Redis shared cache
    "remote_serde": "cachegen",  # Compressed serialization format
})
 
init_lmcache_engine(config)
 
# The vLLM engine is then used exactly as before
# LMCache transparently intercepts KV cache management

llm-d — Cache-Aware Routing on Kubernetes

The core is a cache-aware router called the EPP (Endpoint Picker).

css

[Client Request]
      │
      ▼
[llm-d Gateway (Kubernetes Gateway API)]
      │
      ▼
[EPP — Endpoint Picker]
  ├─ Computes prefix hash of the request
  ├─ Looks up KV block index (llm-d-kv-cache)
  └─ Routes to the Pod holding the matching block
      │
      ▼
[vLLM Pod (cache hit!)]

Hands-On Configuration

Now that the concepts are clear, let's move to actual configuration. The three examples below are connected to each other, so following them in order will reveal the full picture.

yaml

# lmcache-config.yaml
# All vLLM Pods point to the same remote cache
chunk_size: 512         # 1 chunk = 512 tokens = 1 KV block
local_device: "cpu"
max_local_cache_size: 20  # GB, CPU memory allocation
 
# NFS mount path used as shared KV cache
remote_url: "file:///mnt/nfs/lmcache"
remote_serde: "safetensor"

python

# RAG server — document KV pre-warming script
# pip install lmcache vllm
 
import asyncio
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams  # vLLM 0.6+
 
async def prewarm_document_kv(doc_text: str, engine: AsyncLLMEngine):
    """
    Processes a document once to populate the KV cache in external storage.
    Subsequent requests with the same prefix will hit the cache without recomputation.
    """
    system_prompt = f"Please answer based on the following document:\n\n{doc_text}\n\nQuestion: "
    
    # A dummy request generates the prefix KV — LMCache saves the KV to external storage as a side effect
    # max_tokens=1 minimizes the cost of actual response generation
    params = SamplingParams(max_tokens=1)
    async for _ in engine.generate(system_prompt, params, request_id="prewarm"):
        pass
    
    print(f"[KV Prewarm] {len(doc_text)} chars → cache saved successfully")
 
# For subsequent real requests, the same prefix is loaded instantly from LMCache

Component	Role
`chunk_size: 512`	Stores KV blocks in 512-token units. The longer the document, the wider the hit range
`remote_url: "file:///mnt/nfs/..."`	Shared storage mounted via NFS. All Pods access the same cache
`prewarm_document_kv()`	Pre-caches key documents at service startup to eliminate cold starts

Example 2: Configuring Cache-Aware Routing in a Kubernetes Cluster with llm-d

With the LMCache configuration above providing a shared cache, let's add the llm-d EPP for cache-aware routing.

yaml

# llm-d-based InferencePool + EPP deployment example
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: llama-pool
spec:
  targetPortNumber: 8000
  selector:
    matchLabels:
      app: vllm-worker
  extensionRef:
    name: llm-d-epp   # EPP performs cache-aware routing
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-d-epp
spec:
  template:
    spec:
      containers:
      - name: epp
        # In production, use a pinned version tag instead of latest
        image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.5.0
        env:
        - name: KV_CACHE_AWARE_ROUTING
          value: "true"
        - name: PREFIX_HASH_ALGO
          value: "sha256"
        - name: KVBLOCK_INDEX_ENDPOINT
          value: "http://kv-index-service:9090"

yaml

# vLLM Worker Pod — with LMCache integration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-worker
spec:
  replicas: 4
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.0
        args:
        - "--model=meta-llama/Llama-3.1-8B-Instruct"
        - "--enable-prefix-caching"
        - "--kv-transfer-config"
        # JSON-serialize kv_transfer_config and pass it as a CLI argument
        - '{"kv_connector":"LMCacheConnector","kv_role":"kv_both"}'
        volumeMounts:
        - name: lmcache-config
          mountPath: /etc/lmcache
      volumes:
      - name: lmcache-config
        configMap:
          name: lmcache-cfg

Example 3: Optimizing GPU Utilization with Prefill/Decode Disaggregation

With cache-aware routing in place, long-context services can consider Prefill/Decode disaggregation.

java

[Client]
     │
     ▼
[llm-d EPP]
     │
     ├──▶ [Prefill Pod (A100 × 4)]  ← Specialized for long prompt processing (compute-bound)
     │           │
     │    KV block batch transfer (LMCache)
     │           │
     └──▶ [Decode Pod (A100 × 2)]  ← Specialized for token generation (memory-bound)

python

# Example of Prefill/Decode role configuration
# In practice, JSON is passed as the --kv-transfer-config CLI argument when starting vLLM
 
import os
os.environ["LMCACHE_USE_EXPERIMENTAL_FEATURES"] = "True"
 
# Prefill Pod configuration — generates KV and sends it to the Decode Pod
kv_transfer_config_prefill = {
    "kv_connector": "LMCacheConnector",
    "kv_role": "kv_producer",        # KV producer
    "kv_connector_port": 8100,
    "kv_connector_host": "0.0.0.0",
}
 
# Decode Pod configuration — receives KV and begins token generation
kv_transfer_config_decode = {
    "kv_connector": "LMCacheConnector",
    "kv_role": "kv_consumer",        # KV consumer
    "kv_connector_port": 8100,
    "kv_connector_host": "prefill-service",  # Prefill Pod service address
}
 
# vLLM launch example (Prefill Pod):
# vllm serve meta-llama/Llama-3.1-8B-Instruct \
#   --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_producer",...}'

Combining all three configurations produces this request flow:

css

[Client Request]
      │
      ▼
[llm-d EPP]  ← Identifies the cache-holding Pod via prefix hash (Example 2)
      │
      ▼
[Prefill Pod]  ← Processes the prompt and generates KV (Example 3)
      │
      │  LMCache → Saves KV to NFS/Valkey shared storage (Example 1)
      │
      ▼
[Decode Pod]  ← Receives KV and begins token generation
      │
      ▼
[Client Response]

Pros and Cons Analysis

Advantages

Item	Detail
Cost Reduction	Up to 1/10 the token processing cost on a cache hit (e.g., $3.00 → $0.30 / 1M tokens)
Reduced TTFT	Sub-400ms for cache-hit requests per official llm-d benchmarks, up to 85% reduction
Horizontal Scalability	New Pods immediately reuse the external cache upon joining; no warm-up needed
Fault Resilience	KV cache in external storage survives Pod restarts
Throughput Improvement	Up to 2× throughput on the same hardware per llm-d benchmarks

Disadvantages and Caveats

Item	Detail	Mitigation
Network Overhead	KV transfer costs can exceed cache savings	Evaluate high-speed storage such as InfiniStore/Mooncake (RDMA) before adopting
LRU Limitations	Simple LRU cannot predict near-future reuse for workloads like agentic workflows	Analyze workload patterns and tune cache policy accordingly
Memory Tier Complexity	Managing the GPU HBM → CPU DRAM → NVMe → remote storage hierarchy	Incremental adoption — enable CPU DRAM only first, then expand
Security Vulnerability	TTFT measurement can be used to infer other tenants' prompts (Cache Side-Channel)	Mandatory per-tenant cache isolation in multi-tenant environments
Operational Complexity	Additional components to manage: KVEvents streaming, block index synchronization, etc.	Set up monitoring dashboards and pre-configure alert thresholds
Cache Invalidation	Full cache invalidation required on model updates or LoRA switching	Add a cache flush hook to the deployment pipeline

Cache Side-Channel Attack: In a multi-tenant environment, an attack that infers another user's prompt content by measuring response time differences caused by cache hits and misses. A detailed analysis is available in the CacheSolidarity paper.

RDMA (Remote Direct Memory Access): A technology that reads and writes memory directly between servers without involving the CPU. It reduces KV block transfer latency to the microsecond range.

The Most Common Mistakes in Practice

Some of these I've experienced firsthand; others are patterns I frequently observe in teams.

Not measuring shared storage latency in advance — If NFS or S3 latency exceeds tens of milliseconds, a cache hit can actually be a net loss. It's worth measuring actual latency characteristics with tools like fio before adopting. The choice of storage can completely change the outcome.
Applying the same chunk_size to all workloads — Small chunks (1 chunk = 128–256 tokens) are efficient for short conversational queries; large chunks (1 chunk = 512–1024 tokens) suit long RAG documents. Optimal values differ by workload, so measure and adjust as you go.
Deploying in multi-tenant environments without cache isolation — If the tenant ID is not included in the prefix hash, caches are shared across tenants and the Cache Side-Channel security vulnerability described above is introduced. It is strongly recommended to enable tenant isolation options in the EPP configuration.

Closing Thoughts

If you're already running vLLM, the barrier to entry is lower than you might think. Before touching the cluster layer, you can already feel the effect by verifying CPU offload alone on a single node.

Three steps you can take right now:

Experiment with LMCache standalone (verifiable in 30 minutes) — Install it with pip install lmcache in your existing vLLM environment, add only the local_device: "cpu" configuration, and measure CPU offload hit rate on a single Pod first. The LMCache Quickstart is well-documented.
Integrate shared storage (requires an NFS or Valkey environment) — If hit rates are meaningful, specify NFS or Valkey as the remote_url and expand to KV cache sharing across multiple Pods. You can verify the cache-sharing effect at this stage even without llm-d.
Introduce the llm-d EPP (requires a Kubernetes cluster as a prerequisite) — Once cache-sharing benefits are confirmed, deploy the EPP using the Helm chart from llm-d GitHub and add cache-aware routing to push hit rates even higher.

Next article: Inside llm-d's KVEvents streaming protocol — a code-level look in Go at how the block index stays synchronized in real time.

Core Concepts

Why KV Cache Gets Trapped in a Single Instance

The Three Pillars of Distributed Prefix Caching

LMCache — A KV Cache Layer Outside the GPU

llm-d — Cache-Aware Routing on Kubernetes

Hands-On Configuration

Example 1: Sharing Document KV Cache in a RAG Pipeline

Example 2: Configuring Cache-Aware Routing in a Kubernetes Cluster with llm-d

Example 3: Optimizing GPU Utilization with Prefill/Decode Disaggregation

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Why KV Cache Gets Trapped in a Single Instance

The Three Pillars of Distributed Prefix Caching

LMCache — A KV Cache Layer Outside the GPU

llm-d — Cache-Aware Routing on Kubernetes

Hands-On Configuration

Example 1: Sharing Document KV Cache in a RAG Pipeline

Example 2: Configuring Cache-Aware Routing in a Kubernetes Cluster with llm-d

Example 3: Optimizing GPU Utilization with Prefill/Decode Disaggregation

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Cutting Infrastructure Costs 10x with AI Agents — Multi-Agent Performance Optimization Through the Meta Capacity Efficiency Pattern

LangGraph vs CrewAI vs AutoGen — AI Agent Frameworks in 2026: Which One Should You Actually Choose in Practice?

Multi-Agent AI Code Review Orchestration Architecture Pattern Guide

Weight Caching + GPU Snapshot Recipe for Sub-Second Cold Starts with vLLM + Modal Volume

Migrating AI Inference Servers After Fly.io GPU Shutdown — Modal · RunPod · Google Cloud Run Cost Comparison & Cold Start Benchmarks

Migrating a Replit Agent App to a Fly.io GPU Inference Server