Part 6a โ KV Cache: GPU-Level Inference Caching โ
Sources: NetApp Engineering Blog (Mar 2026) ยท NVIDIA Developer Blog (Sep 2025) ยท HPCWire (May 2026)
The Intuition โ
Imagine you are answering questions about a 500-page book. Without a cache, you re-read the entire book from page 1 every time someone asks a new question. That is what an LLM does without KV cache โ it recomputes attention over the entire context on every single new token it generates.
The Technical Reality โ
In a transformer's attention mechanism, every token attends to all previous tokens:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \cdot V$$
For each new token, the model needs K and V tensors for all previous tokens. Without caching, these are recomputed every time.
Without KV Cache
[token 1][token 2]...[token N]- Compute K,V for ALL tokens every step
- O(nยฒ) per token generated
- GPU overloaded
- High latency ยท High cost
With KV Cache
KV Cache (GPU HBM) stores Kโ,Vโ Kโ,Vโ โฆ Kโ,Vโ
- New token โ compute only K_new, V_new โ append
- O(n) per token generated
- Cache hit โ retrieve fast
- GPU freed ยท Low latency
What's cached: Static prefix (system prompt + tools) is computed once and cached forever. Dynamic suffix (user messages, tool results) are appended each turn.
KV Cache at Scale โ The Memory Budget Problem โ
Real numbers (from NVIDIA blog): Llama 3 70B with a 128k token context window for a single user consumes ~40 GB of KV cache. At batch size 10 users โ 400 GB. That is more than most GPUs have.
KV CACHE FIXED MEMORY POOL
+-------------------------------------------------+
| [user1 K/V blocks] [user2 K/V blocks] [...] |
| ################################### <- FULL |
+-------------------------------------------------+
| when full
K/V EVICTIONS (LRU policy)
Evicted user's next request -> CACHE MISS
Cache miss -> full recompute -> latency spikevLLM's Paged KV Cache: Borrowed from OS virtual memory. KV tensors stored in fixed-size pages instead of contiguous blocks.
Key win: Shared Prefix Pages โ 1000 users sharing the same RAG context โ stored once as shared prefix pages โ massive memory efficiency.
The Multi-Node Problem: Where Scale Gets Really Hard โ
Source: NetApp โ "Engineering Inference: KV Cache, Shared Storage, and the Economics of AI" (2026)
Single vLLM node: clean, automatic prefix reuse, boringly simple.
Add a second node behind a load balancer: the rules change completely.
LMCache solves this by treating KV cache as shared infrastructure rather than private per-node memory:
Critical config: Use kv_role: "kv_both" (not just prefill OR decode). Decode-only caching creates subtle mismatches between what was cached during prefill and what is needed during generation.
NetApp finding: Adding the S3 tier shows virtually no downside because S3 and CPU tiers operate synergistically โ S3 catches overflow from CPU RAM without hurting latency for hot entries.
NVIDIA Unified Memory: Hardware-Level Solution (GH200 / Grace Blackwell) โ
Source: NVIDIA Developer Blog (Sep 2025)
The OOM problem made concrete:
- Llama 3 70B in FP16 โ needs ~140 GB GPU memory. GH200 has 96 GB โ OOM error.
- Solution: NVLink-C2C โ 900 GB/s interconnect between CPU (480 GB LPDDR) and GPU (96 GB HBM), creating a single unified address space. 7ร the bandwidth of PCIe Gen 5.
import rmm
import torch
from rmm.allocators.torch import rmm_torch_allocator
# Enable unified memory โ GPU can now transparently spill to CPU RAM
rmm.reinitialize(managed_memory=True)
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)
# Now loads without OOM โ hardware handles data movement automatically
pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-70B")The Economics Argument โ
"Training made the headlines. Inference pays the power bill." โ NetApp/HPCWire 2026
KV cache reuse + quantization together change the economics:
- Reused KV block = GPU compute you did not pay for twice
- CPU/S3 offload = memory pressure not pushed onto expensive accelerators
- Unified memory = serve larger models without OOM, without buying more GPUs
The companies that win will not throw the most GPUs at the problem โ they will engineer smarter inference paths.
Recall Hook โ
Single node: vLLM Paged KV. Multi-node: LMCache + shared tiers (GPUโCPUโS3). Hardware limit: NVIDIA unified memory. Economics: inference > training costs now.
Sources โ
- NetApp โ Engineering Inference: KV Cache, Shared Storage and the Economics of AI (Mar 2026)
- NVIDIA โ Accelerate LLM Inference and KV Cache Offload with CPU-GPU Memory Sharing (Sep 2025)
- HPCWire โ Why the Race to Expand KV Cache is Critical for AI Inference Success (May 2026)
Running vLLM or LMCache in production? Add your config โ specific tuning data is rare and useful.