Part 6b β Prompt Caching: Provider-Level Caching β
Source: Claude Code Blog β Prompt Caching is Everything (Apr 30, 2026) Source: Google Gemini API Caching Docs
The Intuition β
KV cache is inside the inference server β transparent to you. Prompt caching is exposed to you as a developer via the API. You structure your prompt so the stable parts are a shared prefix, and every request that matches it pays ~1/10th the input token price.
"It is often said in engineering that 'cache rules everything around me', and the same rule holds for agents." β Claude Code team
At Claude Code, prompt cache hit rate is monitored like uptime. They declare SEVs (incidents) when it drops too low. A few percentage points of cache miss rate dramatically affects cost and latency.
The System Prompt Layout β The Core Rule β
The core rule: Prefix match, byte-for-byte, from the start. Any change anywhere in the prefix invalidates everything after it. Order is everything.
The 5 Claude Code Lessons (All Hard-Won in Production) β
Lesson 1: Lay Out Your Prompt for Caching β
Static first, dynamic last. They have broken this before by:
- Putting a timestamp in the static system prompt
- Shuffling tool order non-deterministically
- Updating tool parameter schemas
Lesson 2: Use Messages for Updates, Not System Prompt Changes β
If information becomes outdated (current time, file state, mode change), inserting it into the next user message via a <system-reminder> tag preserves the cache. Editing the system prompt = cache miss on every subsequent request.
Lesson 3: Never Change Tools or Models Mid-Session β
- Tools are part of the cached prefix. Add/remove a tool mid-conversation β cache invalidated for the entire conversation.
- Claude Code's solution for Plan Mode: Instead of swapping tool sets, they keep ALL tools in every request and use
EnterPlanMode/ExitPlanModeas tools the model calls itself. Tool definitions never change. - Deferred loading: For MCP tools, send lightweight stubs with
defer_loading: true. The full schema loads only when selected. Stable stubs = stable prefix = cache preserved. - Don't switch models mid-session: Prompt caches are unique to each model. Switching from Opus to Haiku at turn 50 means rebuilding the entire cache for Haiku from scratch β more expensive than Opus answering a simple question on a cache hit.
Lesson 4: Fork Operations Must Share the Parent's Prefix β
When running a side computation (compaction, summarization), use identical system prompt + tools as the parent.
WRONG β new system prompt = no cache hit = expensive:
POST /messages
system: "Summarize the following conversation" <- different prefix, no hit
messages: [full conversation history] <- pay full price for all
CORRECT β same prefix = cache hit = cheap:
POST /messages
system: [exact same system prompt as parent] <- cache hit! 1/10 price
tools: [exact same tool list as parent] <- cache hit!
messages: [full conversation] + ["Summarize"] <- only pay for new tokensLesson 5: Monitor Your Cache Hit Rate Like You Monitor Uptime β
Set up alerts. Treat drops as incidents. Measure it constantly.
Compaction β How Claude Code Handles Long Conversations β
- System + Tools
- user message 1
- assistant + tools
- user message 2
- ... many turns ...
- [compaction buffer]
- System + Tools (same prefix)
- Full conversation
- cache hit = 1/10 price
- + "Summarize this"
- β Summary ~20k tokens
- System + Tools
- compact_boundary
- Conversation summary
- Re-attached files
- room for new turns
The forked call uses the exact parent prefix β cache hit β you pay 1/10th price for the largest part. Only the new compaction prompt is billed at full rate.
Pricing Model β
| Action | Price |
|---|---|
| Cache write (first time) | 1.25Γ normal |
| Cache read (every hit) | 0.10Γ normal |
| Break-even point | ~2 requests |
| After 10 requests | ~90% cost reduction on cached tokens |
Google Gemini Caching β Implicit vs. Explicit β
Implicit Caching (Gemini 2.5+, auto-enabled) β
Zero configuration. Google automatically passes on cost savings if your request hits existing caches. Check hits via response.usage_metadata.cached_token_count.
Tip: Put large, common content at the beginning of your prompt. Send requests with similar prefixes in a short time window to increase hit chances.
Explicit Caching (manual, cost-saving guarantee) β
from google import genai
from google.genai import types
client = genai.Client()
# Step 1: Create a cache (5-minute TTL)
cache = client.caches.create(
model="gemini-3-flash-preview",
config=types.CreateCachedContentConfig(
system_instruction="You are an expert analyzing transcripts.",
contents=[document], # the large stable document
ttl="300s", # cache for 5 minutes
display_name="my-doc-cache"
)
)
# Step 2: Use it in every request
response = client.models.generate_content(
model="gemini-3-flash-preview",
contents="Please summarize this transcript",
config=types.GenerateContentConfig(cached_content=cache.name)
)
print(response.usage_metadata) # shows cached_token_countBest for: Chatbots with extensive system instructions, repetitive video/PDF analysis, recurring queries against large document sets.
Gemini Inference Service Tiers β
| Tier | Price | Latency | Reliability | Best for |
|---|---|---|---|---|
| Standard | Full price | Secondsβminutes | High | General day-to-day apps |
| Priority | 75β100% premium | Seconds | Highest (non-sheddable) | Customer chatbots, real-time fraud detection |
| Flex | 50% of standard | 1β15 min target | Best-effort (sheddable) | Multi-step agent workflows, background CRM, offline evals |
| Batch | 50% of standard | Up to 24 hours | High throughput | Pre-processing datasets, regression suites, bulk embeddings |
| Context Cache | 90% discount + storage | Faster time-to-first-token | N/A | Recurring queries over same content |
Flex is the sleeper for agent workflows: 50% discount, synchronous interface (no async code changes needed), and most multi-step agent chains are not latency-sensitive enough to need Priority pricing.
Recall Hook β
Stable top, dynamic bottom. Never change tools mid-session. Monitor cache hit rate like uptime. Fork operations must share the parent's prefix exactly. Gemini Flex = 50% off for agent workflows.
Sources β
- Claude Code β Prompt Caching is Everything (Apr 30, 2026)
- Google Gemini API Caching Docs
- Google Gemini API Optimization & Service Tiers
Have prompt cache hit rate data from production? Share it β real numbers are rare.