Google takes a different approach to caching than either Anthropic or OpenAI. Gemini 3.1 Pro and Flash-Lite support two distinct caching modes: implicit caching that activates automatically like GPT-5.4, and explicit caching where you create named cache objects with configurable TTLs like Claude. You can use either, or combine both depending on your workload.
There is one important difference you need to account for that the other providers do not have: Gemini charges for cache storage. This changes the cost calculation and means you need to think carefully about TTL settings and cache utilisation before projecting savings. Get this right and Gemini’s one-hour default TTL becomes a significant advantage for applications with moderate traffic. Get it wrong and you pay storage fees without enough cache reads to offset them.
This part covers both caching modes in depth, the full cost model including storage fees, and a complete Python implementation targeting both the Gemini API directly and Vertex AI for enterprise deployments.
Implicit vs Explicit Caching: When to Use Each
Implicit caching is enabled by default on Gemini 3.1 Pro and Flash-Lite. It works the same way as GPT-5.4: the API automatically detects repeated prefixes and serves them from cache with no configuration required. The minimum cacheable size is 2,048 tokens for Gemini 3.1 Pro and 1,024 tokens for Flash-Lite. You see the savings reflected in a cached_content_token_count field in the usage metadata.
Explicit caching gives you direct control. You create a cache object containing your static content, give it a name, set a TTL, and then reference it by name in subsequent requests. The cache object lives independently of your requests and persists until its TTL expires or you delete it. This is particularly useful for large shared documents that many users or processes query against, because you create the cache once and reference it across thousands of requests.
| Feature | Implicit Caching | Explicit Caching |
|---|---|---|
| Configuration | Automatic, no setup | Create cache object manually |
| TTL control | Managed by Google | Configurable, default 1 hour |
| Storage charge | No | Yes, per token per hour |
| Best for | High-volume, repetitive prompts | Large shared documents, multi-user RAG |
| Min tokens (Pro) | 2,048 | 2,048 |
| Min tokens (Flash-Lite) | 1,024 | 1,024 |
flowchart TD
A[Choose Caching Mode] --> B{Workload Type?}
B -->|High volume, repetitive prompts\nNo shared documents| C[Implicit Caching]
B -->|Large shared documents\nMulti-user RAG pipelines| D[Explicit Caching]
B -->|Both patterns present| E[Combine Both Modes]
C --> F[Automatic prefix detection\nNo storage fee\nGoogle-managed TTL]
D --> G[Create cache object once\nConfigurable TTL\nStorage fee applies]
E --> H[Explicit cache for documents\nImplicit for conversation history]
style C fill:#166534,color:#fff
style D fill:#1e3a5f,color:#fff
style E fill:#713f12,color:#fff
The Full Cost Model: Accounting for Storage Fees
Gemini is the only major provider that charges for cache storage, so the cost calculation requires one extra step. Here is the complete breakdown for Gemini 3.1 Pro:
- Standard input tokens: $2.00 per million tokens
- Cached input tokens (read): $0.50 per million tokens (75% discount)
- Cache storage: $1.00 per million tokens per hour
- Output tokens: $18.00 per million tokens
For Flash-Lite the rates are significantly lower at $0.25 per million input, $0.0625 per million cached reads, and $0.50 per million storage per hour, making it the most cost-efficient option for high-volume use cases that do not require frontier reasoning quality.
The break-even calculation for explicit caching works like this. Suppose you cache 100,000 tokens for 1 hour. Storage cost is $0.10 (100k tokens at $1.00/M/hour). Each read saves you $0.1875 versus standard input (100k tokens at $2.00/M standard minus $0.50/M cached = $0.15 saved per read… wait: 100k * (2.00 – 0.50) / 1M = $0.15 per read). You break even after one read and every subsequent read within the hour is net positive. For any document queried more than once per TTL window, explicit caching makes financial sense.
Setup: Installing the Gemini SDK
pip install google-generativeai google-cloud-aiplatformPython Implementation: Implicit Caching
Implicit caching requires no code changes beyond reading the usage metadata to confirm hits are occurring. Here is a production client that structures prompts correctly and tracks cache performance:
# gemini_implicit_cache_client.py
import os
import time
from dataclasses import dataclass, field
from google import generativeai as genai
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
# Gemini 3.1 Pro pricing per million tokens (USD)
PRICING = {
"input": 2.00,
"cached_read": 0.50,
"output": 18.00,
}
@dataclass
class UsageStats:
input_tokens: int = 0
cached_tokens: int = 0
output_tokens: int = 0
cache_hit: bool = False
actual_cost_usd: float = 0.0
savings_usd: float = 0.0
@dataclass
class CacheMetrics:
total_requests: int = 0
cache_hits: int = 0
total_input_tokens: int = 0
total_cached_tokens: int = 0
total_savings_usd: float = 0.0
@property
def hit_rate_percent(self) -> float:
if self.total_requests == 0:
return 0.0
return round(self.cache_hits / self.total_requests * 100, 1)
@property
def token_efficiency_percent(self) -> float:
if self.total_input_tokens == 0:
return 0.0
return round(self.total_cached_tokens / self.total_input_tokens * 100, 1)
def calculate_cost(input_tokens: int, cached_tokens: int, output_tokens: int) -> tuple[float, float]:
non_cached_input = input_tokens - cached_tokens
actual_cost = (
(non_cached_input / 1_000_000) * PRICING["input"]
+ (cached_tokens / 1_000_000) * PRICING["cached_read"]
+ (output_tokens / 1_000_000) * PRICING["output"]
)
full_cost = (
(input_tokens / 1_000_000) * PRICING["input"]
+ (output_tokens / 1_000_000) * PRICING["output"]
)
return actual_cost, full_cost - actual_cost
class GeminiImplicitCacheClient:
"""
Client for Gemini 3.1 Pro with implicit caching.
Structures prompts for maximum cache hit rates:
static system instruction first, dynamic content last.
"""
def __init__(self, model_name: str = "gemini-3.1-pro"):
self.model = genai.GenerativeModel(model_name)
self.metrics = CacheMetrics()
def chat(
self,
system_instruction: str,
conversation_history: list[dict],
user_message: str,
) -> tuple[str, UsageStats]:
"""
Send a message with cache-optimised prompt structure.
system_instruction must be fully static for cache hits.
"""
# Build history in correct order - never reorder existing turns
history = []
for turn in conversation_history:
history.append({
"role": turn["role"],
"parts": [turn["content"]],
})
# Create chat session with static system instruction
# Gemini places system_instruction before all messages,
# making it the most cacheable part of every request
chat = self.model.start_chat(history=history)
response = chat.send_message(
user_message,
generation_config=genai.GenerationConfig(max_output_tokens=2048),
system_instruction=system_instruction,
)
usage = response.usage_metadata
input_tokens = usage.prompt_token_count
cached_tokens = getattr(usage, "cached_content_token_count", 0) or 0
output_tokens = usage.candidates_token_count
cache_hit = cached_tokens > 0
actual_cost, savings = calculate_cost(input_tokens, cached_tokens, output_tokens)
# Update cumulative metrics
self.metrics.total_requests += 1
self.metrics.total_input_tokens += input_tokens
self.metrics.total_cached_tokens += cached_tokens
self.metrics.total_savings_usd += savings
if cache_hit:
self.metrics.cache_hits += 1
stats = UsageStats(
input_tokens=input_tokens,
cached_tokens=cached_tokens,
output_tokens=output_tokens,
cache_hit=cache_hit,
actual_cost_usd=actual_cost,
savings_usd=savings,
)
return response.text, stats
def get_metrics(self) -> CacheMetrics:
return self.metrics
Python Implementation: Explicit Caching
Explicit caching is where Gemini’s approach really differentiates itself. You create a cache object containing your large static content, then reference it across many requests. Here is a production implementation that manages cache lifecycle including creation, reuse, and cost tracking:
# gemini_explicit_cache_client.py
import os
import time
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import Optional
import google.generativeai as genai
from google.generativeai import caching
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
# Explicit cache storage pricing per million tokens per hour
STORAGE_PRICE_PER_MILLION_PER_HOUR = 1.00
INPUT_PRICE_PER_MILLION = 2.00
CACHED_READ_PRICE_PER_MILLION = 0.50
OUTPUT_PRICE_PER_MILLION = 18.00
@dataclass
class ExplicitCacheHandle:
cache_name: str
cached_token_count: int
ttl_hours: float
created_at: float
storage_cost_per_hour_usd: float
@property
def is_expired(self) -> bool:
age_hours = (time.time() - self.created_at) / 3600
return age_hours >= self.ttl_hours
@property
def total_storage_cost_usd(self) -> float:
age_hours = min(
(time.time() - self.created_at) / 3600,
self.ttl_hours
)
return self.storage_cost_per_hour_usd * age_hours
class GeminiExplicitCacheClient:
"""
Client for Gemini 3.1 Pro with explicit context caching.
Creates named cache objects for large shared content
(documents, knowledge bases, system context).
"""
def __init__(self, model_name: str = "gemini-3.1-pro"):
self.model_name = model_name
self._active_caches: dict[str, ExplicitCacheHandle] = {}
def create_document_cache(
self,
cache_key: str,
system_instruction: str,
documents: list[dict],
ttl_hours: float = 1.0,
) -> ExplicitCacheHandle:
"""
Create an explicit cache object for large shared documents.
cache_key is your logical identifier; the actual cache name
is returned in the handle for use in requests.
"""
# Build document content for the cache
document_parts = []
for doc in documents:
document_parts.append(
f"## Document: {doc['title']}\n\n{doc['content']}"
)
combined_content = "\n\n---\n\n".join(document_parts)
# Create the cache object with Gemini API
ttl_seconds = int(ttl_hours * 3600)
cache = caching.CachedContent.create(
model=self.model_name,
system_instruction=system_instruction,
contents=[combined_content],
ttl=timedelta(seconds=ttl_seconds),
display_name=cache_key,
)
cached_token_count = cache.usage_metadata.total_token_count
storage_cost_per_hour = (
cached_token_count / 1_000_000
) * STORAGE_PRICE_PER_MILLION_PER_HOUR
handle = ExplicitCacheHandle(
cache_name=cache.name,
cached_token_count=cached_token_count,
ttl_hours=ttl_hours,
created_at=time.time(),
storage_cost_per_hour_usd=storage_cost_per_hour,
)
self._active_caches[cache_key] = handle
print(
f"[Cache Created] key={cache_key} | "
f"tokens={cached_token_count:,} | "
f"ttl={ttl_hours}h | "
f"storage_cost/h=${storage_cost_per_hour:.6f}"
)
return handle
def query_with_cache(
self,
cache_key: str,
user_query: str,
conversation_history: list[dict] = None,
) -> tuple[str, dict]:
"""
Query using an existing explicit cache.
Falls back gracefully if cache has expired.
"""
handle = self._active_caches.get(cache_key)
if not handle or handle.is_expired:
raise ValueError(
f"Cache '{cache_key}' does not exist or has expired. "
"Call create_document_cache first."
)
# Build model using cached content reference
model = genai.GenerativeModel.from_cached_content(
cached_content=handle.cache_name
)
# Build history
history = []
if conversation_history:
for turn in conversation_history:
history.append({
"role": turn["role"],
"parts": [turn["content"]],
})
chat = model.start_chat(history=history)
response = chat.send_message(user_query)
usage = response.usage_metadata
input_tokens = usage.prompt_token_count
cached_tokens = getattr(usage, "cached_content_token_count", 0) or 0
output_tokens = usage.candidates_token_count
# Calculate costs including storage
non_cached = input_tokens - cached_tokens
read_cost = (
(non_cached / 1_000_000) * INPUT_PRICE_PER_MILLION
+ (cached_tokens / 1_000_000) * CACHED_READ_PRICE_PER_MILLION
+ (output_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION
)
storage_so_far = handle.total_storage_cost_usd
return response.text, {
"input_tokens": input_tokens,
"cached_tokens": cached_tokens,
"output_tokens": output_tokens,
"cache_hit": cached_tokens > 0,
"read_cost_usd": round(read_cost, 8),
"storage_cost_so_far_usd": round(storage_so_far, 8),
"total_cost_usd": round(read_cost + storage_so_far, 8),
}
def delete_cache(self, cache_key: str) -> None:
"""Explicitly delete a cache to stop storage charges."""
handle = self._active_caches.pop(cache_key, None)
if handle:
cache = caching.CachedContent(name=handle.cache_name)
cache.delete()
print(
f"[Cache Deleted] key={cache_key} | "
f"total_storage_cost=${handle.total_storage_cost_usd:.6f}"
)
def list_active_caches(self) -> list[dict]:
return [
{
"key": key,
"cache_name": h.cache_name,
"cached_tokens": h.cached_token_count,
"ttl_hours": h.ttl_hours,
"is_expired": h.is_expired,
"storage_cost_so_far_usd": round(h.total_storage_cost_usd, 6),
}
for key, h in self._active_caches.items()
]
Multi-User RAG with Shared Explicit Cache
This is the most compelling use case for Gemini’s explicit caching: a shared knowledge base queried by many users simultaneously. You create the cache once, and all users share the same cached document tokens.
sequenceDiagram
participant Admin as Admin Process
participant API as Gemini API
participant Store as Cache Store
participant U1 as User 1
participant U2 as User 2
participant U3 as User 3
Admin->>API: create_document_cache(docs, ttl=1h)
API->>Store: Store KV tensors for docs
API-->>Admin: cache_name + token_count
U1->>API: query_with_cache(cache_name, query1)
Store-->>API: Load cached doc tensors
API-->>U1: Response (75% discount on doc tokens)
U2->>API: query_with_cache(cache_name, query2)
Store-->>API: Load cached doc tensors
API-->>U2: Response (75% discount on doc tokens)
U3->>API: query_with_cache(cache_name, query3)
Store-->>API: Load cached doc tensors
API-->>U3: Response (75% discount on doc tokens)
Note over Store: Single storage cost shared across all users
# rag_shared_cache_example.py
import asyncio
from gemini_explicit_cache_client import GeminiExplicitCacheClient
SYSTEM_INSTRUCTION = """You are an enterprise technical support specialist.
Answer questions accurately based only on the provided documentation.
If information is not in the documents, say so clearly.
Format responses with clear structure and cite document sections."""
PRODUCT_DOCS = [
{
"title": "API Reference v4.2",
"content": """
## Authentication
All API requests require a Bearer token in the Authorization header.
Tokens expire after 24 hours. Use POST /auth/refresh to renew.
## Rate Limits
- Standard tier: 1,000 requests/minute
- Enterprise tier: 10,000 requests/minute
- Burst allowance: 2x the tier limit for up to 30 seconds
## Endpoints
### GET /v4/data
Returns paginated data records. Parameters: page, limit (max 500), filter.
### POST /v4/data
Creates a new record. Body: { type, payload, metadata }.
### DELETE /v4/data/{id}
Soft-deletes a record. Recoverable within 30 days via POST /v4/data/{id}/restore.
""",
},
{
"title": "Troubleshooting Guide v2.1",
"content": """
## Common Errors
### 429 Too Many Requests
You have exceeded your rate limit. Implement exponential backoff starting at 1 second.
Check your tier limits at /account/limits.
### 401 Unauthorized
Token is missing, expired, or malformed. Refresh using POST /auth/refresh.
New tokens are valid for 24 hours.
### 503 Service Unavailable
Temporary outage. Retry after 60 seconds. Check status.api.example.com for incidents.
## Performance Optimisation
Use field projection to request only needed fields: GET /v4/data?fields=id,type,created_at
Enable compression: set Accept-Encoding: gzip in request headers.
""",
},
]
async def simulate_multi_user_support():
client = GeminiExplicitCacheClient()
# Create shared cache once - all users benefit
print("Creating shared document cache...")
handle = client.create_document_cache(
cache_key="product_docs_v4",
system_instruction=SYSTEM_INSTRUCTION,
documents=PRODUCT_DOCS,
ttl_hours=1.0,
)
print(f"Cache ready: {handle.cached_token_count:,} tokens cached\n")
# Simulate concurrent user queries against the same cache
user_queries = [
("user_1", "How do I handle a 429 error in my client?"),
("user_2", "What is the maximum page size for GET /v4/data?"),
("user_3", "How long are authentication tokens valid?"),
("user_4", "Can I recover a deleted record?"),
]
total_savings = 0.0
for user_id, query in user_queries:
print(f"[{user_id}] Query: {query}")
answer, usage = client.query_with_cache("product_docs_v4", query)
print(f"[{user_id}] Answer: {answer[:150]}...")
print(
f"[{user_id}] Cache hit: {usage['cache_hit']} | "
f"Cached tokens: {usage['cached_tokens']:,} | "
f"Read cost: ${usage['read_cost_usd']:.8f}"
)
savings = (usage['cached_tokens'] / 1_000_000) * (2.00 - 0.50)
total_savings += savings
print(f"[{user_id}] Turn savings: ${savings:.6f}\n")
print(f"Total savings across {len(user_queries)} users: ${total_savings:.6f}")
print(f"Storage cost: ${handle.total_storage_cost_usd:.8f}")
print(f"Net savings: ${total_savings - handle.total_storage_cost_usd:.6f}")
# Clean up to stop storage charges
client.delete_cache("product_docs_v4")
asyncio.run(simulate_multi_user_support())
Combining Implicit and Explicit Caching
For complex enterprise applications, using both modes together gives you the best of each. Explicit caching handles your large shared documents. Implicit caching handles the growing conversation history within each user session. Here is how the two layers interact:
flowchart TD
subgraph ExplicitLayer["Explicit Cache Layer (Admin-managed)"]
D1["Product Docs\n~50k tokens\nTTL: 1 hour\nShared across all users"]
D2["Policy Documents\n~30k tokens\nTTL: 24 hours\nUpdated daily"]
end
subgraph ImplicitLayer["Implicit Cache Layer (Auto-managed per session)"]
S1["User A conversation\nGrows per turn\nAuto-cached by Gemini"]
S2["User B conversation\nGrows per turn\nAuto-cached by Gemini"]
end
subgraph Request["Each API Request"]
R["Explicit cache ref\n+ conversation history\n+ new user message"]
end
ExplicitLayer --> Request
ImplicitLayer --> Request
style ExplicitLayer fill:#1e3a5f,color:#fff
style ImplicitLayer fill:#166534,color:#fff
Vertex AI Implementation for Enterprise
For enterprise deployments on Google Cloud, you will use Vertex AI rather than the direct Gemini API. The caching API is available on Vertex AI with the same semantics but authenticated through Google Cloud credentials:
# vertex_ai_cache_client.py
import os
from datetime import timedelta
import vertexai
from vertexai.generative_models import GenerativeModel, Content, Part
from vertexai.preview import caching as vertex_caching
PROJECT_ID = os.environ["GOOGLE_CLOUD_PROJECT"]
LOCATION = "us-central1"
vertexai.init(project=PROJECT_ID, location=LOCATION)
class VertexAICacheClient:
"""
Explicit context caching client for Vertex AI deployments.
Uses service account credentials via Application Default Credentials.
"""
def __init__(self, model_name: str = "gemini-3.1-pro-002"):
self.model_name = model_name
self._caches: dict[str, vertex_caching.CachedContent] = {}
def create_cache(
self,
cache_key: str,
system_instruction: str,
content_text: str,
ttl_hours: float = 1.0,
) -> str:
"""Create a cached content object on Vertex AI."""
cached_content = vertex_caching.CachedContent.create(
model_name=self.model_name,
system_instruction=Content(
role="system",
parts=[Part.from_text(system_instruction)]
),
contents=[
Content(
role="user",
parts=[Part.from_text(content_text)]
)
],
ttl=timedelta(hours=ttl_hours),
)
self._caches[cache_key] = cached_content
print(f"[Vertex AI Cache] Created: {cached_content.name}")
return cached_content.name
def query(self, cache_key: str, user_message: str) -> tuple[str, dict]:
"""Query using a Vertex AI cached content object."""
cached_content = self._caches.get(cache_key)
if not cached_content:
raise ValueError(f"No cache found for key: {cache_key}")
model = GenerativeModel.from_cached_content(cached_content)
response = model.generate_content(user_message)
usage = response.usage_metadata
return response.text, {
"prompt_tokens": usage.prompt_token_count,
"cached_tokens": getattr(usage, "cached_content_token_count", 0),
"output_tokens": usage.candidates_token_count,
}
def update_cache_ttl(self, cache_key: str, new_ttl_hours: float) -> None:
"""Extend or reduce TTL on an existing cache."""
cached_content = self._caches.get(cache_key)
if cached_content:
cached_content.update(ttl=timedelta(hours=new_ttl_hours))
print(f"[Vertex AI Cache] TTL updated to {new_ttl_hours}h for {cache_key}")
def delete_cache(self, cache_key: str) -> None:
"""Delete cache to stop storage charges."""
cached_content = self._caches.pop(cache_key, None)
if cached_content:
cached_content.delete()
print(f"[Vertex AI Cache] Deleted: {cache_key}")
TTL Strategy: Matching Cache Lifetime to Traffic Patterns
Because Gemini charges for storage, TTL selection directly affects cost. The right TTL is the shortest window that covers your typical request burst pattern.
For a customer support application that receives most queries during business hours, a 1-hour TTL set at the start of each business day works well. For a document analysis pipeline that runs in nightly batches, creating the cache at batch start and deleting it when the batch completes is more efficient than paying for a full hour TTL that sits idle afterward.
Gemini lets you update TTL on an existing cache without recreating it. If you discover a cache is being used longer than expected, extend the TTL rather than letting it expire and paying a write cost to recreate it.
Gemini 3.1 Pro vs Flash-Lite: Choosing the Right Model
For applications where caching is the primary cost driver, Flash-Lite deserves serious consideration. The 45 percent faster response time and substantially lower pricing make it the right choice for high-volume classification, summarisation, or retrieval tasks that do not require frontier reasoning depth.
| Criterion | Gemini 3.1 Pro | Gemini 3.1 Flash-Lite |
|---|---|---|
| Input price | $2.00/M tokens | $0.25/M tokens |
| Cached read price | $0.50/M tokens | $0.0625/M tokens |
| Storage price | $1.00/M/hour | $0.50/M/hour |
| Output price | $18.00/M tokens | $1.50/M tokens |
| Response speed | Baseline | 45% faster |
| Min cache tokens | 2,048 | 1,024 |
| Best for | Complex reasoning, long docs | High-volume, simpler tasks |
Common Mistakes with Gemini Context Caching
Ignoring storage costs in ROI projections. Storage charges are small per cache but accumulate if you maintain many caches or set TTLs longer than needed. Always factor storage into your break-even calculation.
Not deleting caches after batch jobs complete. If you create a cache for a nightly batch and leave the TTL at 1 hour, you pay for unused storage. Delete explicitly when the job finishes.
Using explicit caching for small content. The storage overhead and management complexity of explicit caching only makes sense for large shared content above a few thousand tokens. For smaller prompts, rely on implicit caching.
Not reading cached_content_token_count. Implicit caching is silent by default. Without instrumenting this field you have no visibility into whether hits are occurring. Always log it.
What Is Next
Part 5 moves up the stack to semantic caching with Redis 8.6. Unlike the prefix-based caching covered in Parts 2 through 4, semantic caching operates at the application layer using vector embeddings to match similar queries to previously computed responses. It can achieve cache hit rates above 80 percent for repetitive workloads and works alongside any of the provider-level caching mechanisms covered so far.
References
- Google DeepMind – “Gemini 3.1 Pro: A smarter model for your most complex tasks” (https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/)
- SiliconANGLE – “Google launches speedy Gemini 3.1 Flash-Lite model in preview” (https://siliconangle.com/2026/03/03/google-launches-speedy-gemini-3-1-flash-lite-model-preview/)
- Google AI for Developers – “Gemini API Models Documentation” (https://ai.google.dev/gemini-api/docs/models)
- DigitalOcean – “Prompt Caching Explained: OpenAI, Claude, and Gemini” (https://www.digitalocean.com/community/tutorials/prompt-caching-explained)
- PromptHub – “Prompt Caching with OpenAI, Anthropic, and Google Models” (https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models)
- OpenRouter – “Prompt Caching Best Practices” (https://openrouter.ai/docs/guides/best-practices/prompt-caching)
- Google Cloud – “Context Caching Overview – Vertex AI” (https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview)
