Skip to content

Part 6a โ€” KV Cache: GPU-Level Inference Caching โ€‹

Sources: NetApp Engineering Blog (Mar 2026) ยท NVIDIA Developer Blog (Sep 2025) ยท HPCWire (May 2026)


The Intuition โ€‹

Imagine you are answering questions about a 500-page book. Without a cache, you re-read the entire book from page 1 every time someone asks a new question. That is what an LLM does without KV cache โ€” it recomputes attention over the entire context on every single new token it generates.


The Technical Reality โ€‹

In a transformer's attention mechanism, every token attends to all previous tokens:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \cdot V$$

For each new token, the model needs K and V tensors for all previous tokens. Without caching, these are recomputed every time.

Without KV Cache

  • [token 1][token 2]...[token N]
  • Compute K,V for ALL tokens every step
  • O(nยฒ) per token generated
  • GPU overloaded
  • High latency ยท High cost

With KV Cache

KV Cache (GPU HBM) stores Kโ‚,Vโ‚ Kโ‚‚,Vโ‚‚ โ€ฆ Kโ‚™,Vโ‚™

  • New token โ†’ compute only K_new, V_new โ†’ append
  • O(n) per token generated
  • Cache hit โ†’ retrieve fast
  • GPU freed ยท Low latency

What's cached: Static prefix (system prompt + tools) is computed once and cached forever. Dynamic suffix (user messages, tool results) are appended each turn.


KV Cache at Scale โ€” The Memory Budget Problem โ€‹

Real numbers (from NVIDIA blog): Llama 3 70B with a 128k token context window for a single user consumes ~40 GB of KV cache. At batch size 10 users โ†’ 400 GB. That is more than most GPUs have.

KV CACHE FIXED MEMORY POOL
+-------------------------------------------------+
|  [user1 K/V blocks] [user2 K/V blocks] [...]  |
|  ################################### <- FULL   |
+-------------------------------------------------+
              | when full
   K/V EVICTIONS (LRU policy)
   Evicted user's next request -> CACHE MISS
   Cache miss -> full recompute -> latency spike

vLLM's Paged KV Cache: Borrowed from OS virtual memory. KV tensors stored in fixed-size pages instead of contiguous blocks.

Key win: Shared Prefix Pages โ€” 1000 users sharing the same RAG context โ†’ stored once as shared prefix pages โ†’ massive memory efficiency.


The Multi-Node Problem: Where Scale Gets Really Hard โ€‹

Source: NetApp โ€” "Engineering Inference: KV Cache, Shared Storage, and the Economics of AI" (2026)

Single vLLM node: clean, automatic prefix reuse, boringly simple.

Add a second node behind a load balancer: the rules change completely.

LMCache solves this by treating KV cache as shared infrastructure rather than private per-node memory:

Critical config: Use kv_role: "kv_both" (not just prefill OR decode). Decode-only caching creates subtle mismatches between what was cached during prefill and what is needed during generation.

NetApp finding: Adding the S3 tier shows virtually no downside because S3 and CPU tiers operate synergistically โ€” S3 catches overflow from CPU RAM without hurting latency for hot entries.


NVIDIA Unified Memory: Hardware-Level Solution (GH200 / Grace Blackwell) โ€‹

Source: NVIDIA Developer Blog (Sep 2025)

The OOM problem made concrete:

  • Llama 3 70B in FP16 โ†’ needs ~140 GB GPU memory. GH200 has 96 GB โ†’ OOM error.
  • Solution: NVLink-C2C โ€” 900 GB/s interconnect between CPU (480 GB LPDDR) and GPU (96 GB HBM), creating a single unified address space. 7ร— the bandwidth of PCIe Gen 5.
python
import rmm
import torch
from rmm.allocators.torch import rmm_torch_allocator

# Enable unified memory โ€” GPU can now transparently spill to CPU RAM
rmm.reinitialize(managed_memory=True)
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)

# Now loads without OOM โ€” hardware handles data movement automatically
pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-70B")

The Economics Argument โ€‹

"Training made the headlines. Inference pays the power bill." โ€” NetApp/HPCWire 2026

KV cache reuse + quantization together change the economics:

  • Reused KV block = GPU compute you did not pay for twice
  • CPU/S3 offload = memory pressure not pushed onto expensive accelerators
  • Unified memory = serve larger models without OOM, without buying more GPUs

The companies that win will not throw the most GPUs at the problem โ€” they will engineer smarter inference paths.


Recall Hook โ€‹

Single node: vLLM Paged KV. Multi-node: LMCache + shared tiers (GPUโ†’CPUโ†’S3). Hardware limit: NVIDIA unified memory. Economics: inference > training costs now.


Sources โ€‹

Running vLLM or LMCache in production? Add your config โ€” specific tuning data is rare and useful.

Built from real deployments. Not theory.