Skip to content

Part 6c โ€” Semantic Caching: Meaning-Level Cache โ€‹

Sources: Microsoft Azure Cosmos DB Semantic Cache ยท LiteLLM Proxy Caching


The Intuition โ€‹

Prompt caching says: "If the exact bytes match, reuse the result." Semantic caching says: "If the meaning matches, reuse the result."

User 1: "What is the return policy?"
User 2: "How do I return a product?"       <- different words, same intent
User 3: "Can I get a refund on my order?"  <- different words, same intent

Prompt caching:   3 cache misses (all different text)
Semantic caching: 1 cache miss (first one), then 2 hits (same intent)

How It Works โ€‹

New request
     |
  Embed query -> vector embedding
     |
  Vector similarity search against cached (query, response) pairs
  similarity score = 0 (no match) to 1 (exact match)
     |
  Score > threshold? -> return cached response (no LLM call)
  Score <= threshold? -> call LLM -> embed and store (query, response) pair

The Context Window Problem โ€” Critical Production Gotcha โ€‹

Source: Microsoft Azure Cosmos DB Docs

The problem: A cache key of just the latest message is dangerously wrong.

Classic failure scenario:

User asks: "What is the largest lake in North America?"
-> LLM: "Lake Superior." -> cached

User (same session, next turn) asks: "What is the second largest?"
-> With context: "Lake Huron" -> cached

--- Later, different user, different session ---

New user asks: "What is the largest stadium in North America?"
-> LLM: "Michigan Stadium." -> cached

New user then asks: "What is the second largest?"
-> Semantic cache finds "What is the second largest?" from before
-> Returns "Lake Huron"  <-- WRONG. Context was about lakes, not stadiums.

The fix: Cache keys must include context window history, not just the latest message. Vectorize the sliding window of recent prompts + the new message as the lookup key. This ensures what is returned from cache is contextually correct.


Similarity Score Tuning โ€‹

This requires trial and error in production:

ThresholdEffect
Too high (e.g., 0.99)Few hits. Cache fills with near-duplicate entries. High LLM spend.
Too low (e.g., 0.70)Too many hits. Returns responses for similar but actually different questions. Wrong answers.
Sweet spotTypically 0.92โ€“0.97 for general use. Domain-specific embeddings improve this significantly.

Cache Maintenance โ€‹

Semantic caches grow large if not pruned:

  • TTL: Set expiry on cached items (stale answers become dangerous over time)
  • Hit count: Track how often each item is hit โ€” evict low-hit items, extend TTL of high-hit items
  • Recency filter: Serve only the most recently cached version of similar questions

LiteLLM โ€” Production Caching Stack โ€‹

LiteLLM provides a gateway-level caching layer across all model providers (OpenAI, Anthropic, Gemini, Azure, etc.) โ€” one config to cache everything.

Supported cache backends:

  • In-memory (dev only)
  • Disk
  • Redis (exact match, production standard)
  • Qdrant Semantic Cache (meaning-level)
  • Redis Semantic Cache
  • S3 / GCS (long-term storage)
yaml
# config.yaml โ€” exact match + semantic cache setup
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o

litellm_settings:
  cache: true
  cache_params:
    type: redis           # exact prompt cache via Redis
    host: localhost
    port: 6379
    ttl: 600              # 10 minute default TTL
    namespace: "prod.cache"

# For semantic caching, use Qdrant:
# type: qdrant-semantic-cache
# similarity_threshold: 0.95

Per-request cache controls (dynamic overrides):

python
# Force fresh response, bypass cache
client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    extra_body={"cache": {"no-cache": True}}   # skip cache check
)

# Custom TTL for this specific request
extra_body={"cache": {"ttl": 300}}             # cache for 5 minutes only

# Only accept cached responses < 10 minutes old
extra_body={"cache": {"s-maxage": 600}}

# Don't store this response (e.g., PII-sensitive)
extra_body={"cache": {"no-store": True}}

Debug your cache: GET /cache/ping returns health status, Redis version, connection pool info. View cache key in response: check x-litellm-cache-key response header. Shared auth cache across workers: enable_redis_auth_cache: true โ€” prevents each worker pod from making independent DB lookups on key verification.


Prompt Caching vs. Semantic Caching Tradeoff โ€‹

DimensionPrompt CachingSemantic Caching
What matchesExact byte prefixEmbedding similarity
Best forSystem prompts, tools, documentsFAQ queries, repeated user intents
Staleness riskVery low (exact = always same context)Medium-high (similar โ‰  same context)
Implementation complexityLow (API parameter)High (vector DB, embedding model, threshold tuning)
Context window issueNot applicableCritical โ€” must include history in key
Cold startEvery new deploymentWarm-up time to build cache

Full Four-Layer Production Stack Recap โ€‹

User Request
     |
Layer 1: Semantic Cache  -> miss ->
Layer 2: Exact Prompt Cache -> miss ->
Layer 3: LLM Inference (inside: Layer 4 KV Cache)

Recall Hook โ€‹

Semantic cache = meaning-level, needs context window in key, threshold tuning required. LiteLLM = one config to rule all providers. Four-layer stack: Semantic โ†’ Exact โ†’ LLM โ†’ KV.


Sources โ€‹

Tuned a semantic cache in production? Share your threshold and embedding model โ€” domain-specific data is hard to find.

Built from real deployments. Not theory.