If you are running LLMs in production and you have not set up prompt caching, you are paying more than you need to and your users are waiting longer than they should. It is not a complex optimization. It does not require a new architecture. But most teams skip it because they do not fully understand how it works or how much it actually saves.
This series covers prompt caching and context engineering end to end, with production-ready implementations for Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. Part 1 starts with the foundations: what prompt caching actually is, how it works at the model level, and why it matters more in 2026 than it did even a year ago.
The Problem Prompt Caching Solves
Every time you send a request to an LLM API, the model processes your entire prompt from scratch. That includes the system prompt, any instructions, conversation history, retrieved documents from RAG pipelines, tool definitions, and then finally the user’s actual question at the end.
For a typical enterprise chatbot with a detailed system prompt and a few turns of conversation history, you might be sending 4,000 to 10,000 tokens with every single request. The model reprocesses all of that every time, even though most of it has not changed at all.
At scale, this adds up fast. Consider an application making 50,000 requests per day with an average of 6,000 input tokens per request. That is 300 million tokens of input processed daily. If 70 percent of those tokens are repeated context (system prompt, static instructions, shared documents), you are paying to reprocess 210 million redundant tokens every single day.
Prompt caching solves this by storing the computed state of your prompt prefix so that repeated portions do not need to be reprocessed. The model recognises the prefix, skips the computation, and jumps straight to processing what is new.
How Prompt Caching Works Under the Hood
To understand caching, you need a basic picture of what actually happens during LLM inference. When a model processes your prompt, it runs through a mechanism called attention, where every token looks at every other token in the context. This produces a set of key-value (KV) tensors for each attention layer in the model. These tensors represent the model’s internal understanding of your input at every layer.
Without caching, these KV tensors are computed fresh for every request and then discarded. With prompt caching, the provider stores these tensors server-side. When a subsequent request starts with the same prefix, the provider detects the match, loads the cached tensors instead of recomputing them, and picks up inference from there.
The result is that the model only needs to process the new tokens at the end of your request, which is typically a fraction of the total input. This cuts both latency and cost significantly.
sequenceDiagram
participant Client
participant API as LLM API Gateway
participant Cache as KV Cache Store
participant Model as LLM Model
Note over Client,Model: First Request (Cache Miss)
Client->>API: Full prompt (system + history + user query)
API->>Cache: Check for cached prefix
Cache-->>API: Cache miss
API->>Model: Process full prompt
Model->>Cache: Store KV tensors for prefix
Model-->>API: Response + cache_creation_tokens
API-->>Client: Response
Note over Client,Model: Subsequent Request (Cache Hit)
Client->>API: Same prefix + new user query
API->>Cache: Check for cached prefix
Cache-->>API: Cache hit - return KV tensors
API->>Model: Process only new tokens
Model-->>API: Response + cache_read_tokens
API-->>Client: Response (faster, cheaper)
Two Types of Caching You Need to Know
Before going into provider implementations, it helps to understand that there are actually two distinct caching strategies that often get conflated.
Prompt Caching (Prefix Caching)
This is what all three major providers now offer. It caches the computed KV state of a stable prompt prefix. The match must be exact: the same bytes in the same order. Even a single character difference at any point in the cached prefix will cause a cache miss. This is why prompt structure matters so much, and why we will spend time in later parts discussing how to engineer your prompts for maximum cache hit rates.
The key rule is: static content first, dynamic content last. Your system prompt and any shared context should always come before the user’s query and any per-request variables.
Semantic Caching (Request-Response Caching)
This is a separate technique that operates at the application layer rather than the model layer. Instead of caching internal model state, you cache the final output for a given input. When a new request comes in, you compare it semantically (using vector embeddings) to previously seen requests. If it is similar enough, you return the cached response without calling the LLM at all.
Semantic caching can achieve hit rates above 80 percent for applications with repetitive queries, like customer support bots or internal knowledge tools. It works well alongside prompt caching since they operate at different layers and solve different problems. Part 5 of this series covers semantic caching with Redis in depth.
flowchart TD
A[Incoming User Request] --> B{Semantic Cache Check}
B -->|Hit - Similar Query Found| C[Return Cached Response]
B -->|Miss - New Query| D{Prompt Cache Check}
D -->|Hit - Prefix Cached| E[Process New Tokens Only]
D -->|Miss - Cold Start| F[Process Full Prompt]
E --> G[Generate Response]
F --> G
G --> H[Store in Semantic Cache]
G --> I[Response to User]
C --> I
style C fill:#22c55e,color:#fff
style E fill:#3b82f6,color:#fff
style F fill:#ef4444,color:#fff
The Numbers: What You Actually Save
The savings differ by provider, but they are substantial across the board. Here is a practical breakdown using current 2026 pricing.
Claude Sonnet 4.6
Cache writes cost 25 percent more than standard input tokens. Cache reads cost 90 percent less than standard input tokens. With a 5-minute default TTL (extendable to 1 hour at additional cost), repeated requests within that window benefit from the full discount. Anthropic reports up to 85 percent faster response times for long cached prompts.
GPT-5.4
Prompt caching is automatic on GPT-5.4 with no configuration required. The model requires a minimum of 1,024 tokens to activate caching. Cache reads are priced at $0.625 per million tokens compared to $2.50 per million for standard input, a 75 percent reduction. Cache entries remain active for 5 to 10 minutes of inactivity, with longer persistence during off-peak periods. GPT-5.4 also introduces Tool Search, which reduces the token overhead of large tool sets by loading tool definitions on demand rather than upfront.
Gemini 3.1 Pro
Gemini uses the term context caching rather than prompt caching, but the mechanism is the same. The default TTL is 1 hour, which is more generous than other providers. Cache reads receive a 75 percent discount on input tokens. However, Gemini is the only major provider that charges for cache storage itself, so the cost calculation requires accounting for both read savings and storage fees. The minimum cacheable size is 2,048 tokens for Gemini 3.1 Pro.
| Provider | Model | Cache Write Cost | Cache Read Cost | Min Tokens | Default TTL |
|---|---|---|---|---|---|
| Anthropic | Claude Sonnet 4.6 | +25% vs standard | -90% vs standard | 1,024 | 5 minutes |
| OpenAI | GPT-5.4 | No extra charge | $0.625/M (vs $2.50/M) | 1,024 | 5-10 minutes |
| Gemini 3.1 Pro | Storage fee applies | -75% vs standard | 2,048 | 1 hour |
Where Prompt Caching Helps Most
Not every application benefits equally from prompt caching. The gains are proportional to how much of your prompt is static versus dynamic. Here are the scenarios where it has the biggest impact.
Long System Prompts
Enterprise applications often have detailed system prompts covering persona, output format, compliance rules, and domain-specific instructions. These can run into thousands of tokens and never change between requests. This is the simplest and most immediate caching win.
RAG Pipelines with Shared Documents
When multiple users query the same set of documents (a product manual, a policy document, a codebase), those documents get included in every prompt. If you structure your prompt so the documents appear before the user query, the document content can be cached and reused across all users making requests against the same source material.
Multi-Turn Conversations
Each turn of a conversation grows the context window. Without caching, you resend the entire conversation history with each new message. With caching, the earlier turns stay in the cache and only the new message needs to be processed. In long conversations, this alone can reduce input token costs by 60 to 80 percent.
Agentic Workflows
Agents often make dozens of LLM calls in a single workflow. The task description, available tools, and prior action history are repeated across many of those calls. Prompt caching is especially valuable here because the context grows with every step and the static portions (task description, tool definitions) are repeated throughout. Recent research using GPT-5.4 and Claude Sonnet 4.5 on long-horizon agentic benchmarks showed caching reduced input costs by 40 to 70 percent across multi-step workflows.
flowchart LR
subgraph WithoutCaching["Without Prompt Caching"]
direction TB
R1["Request 1\nSystem + History(0) + Query"] --> M1[Full Processing]
R2["Request 2\nSystem + History(1) + Query"] --> M2[Full Processing]
R3["Request 3\nSystem + History(2) + Query"] --> M3[Full Processing]
end
subgraph WithCaching["With Prompt Caching"]
direction TB
C1["Request 1\nSystem + History(0) + Query"] --> P1[Full Processing + Cache Write]
C2["Request 2\n[CACHED] + History(1) + Query"] --> P2[Partial Processing]
C3["Request 3\n[CACHED] + History(2) + Query"] --> P3[Partial Processing]
end
style WithoutCaching fill:#fee2e2,stroke:#ef4444
style WithCaching fill:#dcfce7,stroke:#22c55e
What Prompt Caching Does Not Do
It is worth being direct about the limitations before you build your architecture around caching assumptions.
Prompt caching does not change output quality. The model produces the same response whether it reads from cache or processes fresh. The KV tensors it loads are mathematically identical to what it would have computed, so there is no quality trade-off.
Prompt caching does not persist indefinitely. Cache entries expire. Claude’s default TTL is 5 minutes, GPT-5.4 caches for 5 to 10 minutes of inactivity, and Gemini’s default is 1 hour. If your application has low traffic or long gaps between requests, you will miss the cache frequently. Understanding your traffic patterns is essential for estimating real-world cache hit rates before making cost projections.
Prompt caching does not work with dynamic prefixes. If any part of your cached prefix changes between requests, you get a cache miss. This includes whitespace, punctuation, JSON key ordering, or anything else that alters the exact byte sequence. Dynamic content like timestamps, user IDs, or personalisation fields must always come after your cache breakpoints, not before.
Prompt caching is not free on writes. Claude charges a 25 percent premium on cache write tokens. Gemini charges storage fees. You need enough cache hits to offset the write cost before caching becomes net positive. For low-volume or highly variable prompts, the math may not work in your favour.
Context Engineering: The Bigger Picture
Prompt caching is one piece of a broader discipline called context engineering. As models have become more capable and context windows have grown (GPT-5.4 supports 1 million tokens via the API), the problem has shifted from “can the model handle this?” to “how do I structure and manage context efficiently at production scale?”
Context engineering covers how you organise information within your context window, what you include and exclude, how you cache and reuse stable portions, how you retrieve and inject dynamic information from external sources, and how you measure the cost and quality impact of all of these decisions.
A well-engineered context is one that gives the model exactly what it needs, nothing more, structured in a way that maximises cache reuse and minimises token waste. Getting this right is increasingly a core engineering competency for teams running AI in production.
What Is Coming in This Series
This series builds from here into full production implementations across all three major providers, with working code in Node.js, Python, and C#.
- Part 2: Prompt caching with Claude Sonnet 4.6 – explicit cache_control breakpoints, TTL configuration, multi-breakpoint strategies, and Node.js implementation
- Part 3: Prompt caching with GPT-5.4 – automatic caching, prefix structure rules, Tool Search integration, and C# implementation
- Part 4: Context caching with Gemini 3.1 Pro and Flash-Lite – implicit vs explicit caching, TTL management, storage cost accounting, and Python implementation
- Part 5: Semantic caching with Redis 8.6 – vector similarity matching, hit rate optimisation, and enterprise deployment patterns
- Part 6: Context engineering strategies – prompt structure design, static-first architecture, and cache-aware RAG patterns
- Part 7: Multi-provider caching with a unified AI gateway – routing, fallback, and cross-provider cache tracking in Node.js
- Part 8: Production monitoring and cost optimisation – cache hit rate dashboards, TTFT measurement, and ROI calculation frameworks
References
- DigitalOcean – “Prompt Caching Explained: OpenAI, Claude, and Gemini” (https://www.digitalocean.com/community/tutorials/prompt-caching-explained)
- OpenRouter Documentation – “Prompt Caching Best Practices” (https://openrouter.ai/docs/guides/best-practices/prompt-caching)
- arXiv – “Don’t Break the Cache: Prompt Caching for Long-Horizon Agentic Tasks” (https://arxiv.org/html/2601.06007v1)
- OpenAI – “Introducing GPT-5.4” (https://openai.com/index/introducing-gpt-5-4/)
- Google DeepMind – “Gemini 3.1 Pro: A smarter model for your most complex tasks” (https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/)
- DasRoot – “Caching Strategies for LLM Responses 2026” (https://dasroot.net/posts/2026/02/caching-strategies-for-llm-responses/)
- PromptHub – “Prompt Caching with OpenAI, Anthropic, and Google Models” (https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models)
