The previous four parts focused on the mechanics of caching: how each provider stores KV tensors, how to mark breakpoints, how to build semantic cache layers. This part steps back and asks a broader question: how should you design your entire context strategy so that caching, retrieval, and cost control all work together rather than against each other?
Context engineering is the discipline of deciding what goes into your LLM context window, in what order, how much of it, and how it changes across requests. As models support million-token context windows and production costs scale with every token processed, getting this right is no longer optional. It is a core competency for any team running AI at scale.
This part covers the principles and patterns that apply across all three providers: static-first architecture, cache-aware RAG pipeline design, prompt versioning as code, token budget management, and context hygiene.
The Static-First Principle
Every caching system in this series shares one requirement: stable content must come before dynamic content. This is a hard constraint imposed by prefix matching. Any token that varies between requests, placed before a stable section, breaks the cached prefix and forces everything after it to be reprocessed.
Map every element of your prompt to a stability zone before deciding where to place cache breakpoints. Zone 1 and Zone 2 content should always be cached. Zone 3 can be cached with care as it grows. Zone 4 should never be cached.
flowchart TD
subgraph Z1["Zone 1 - Permanent Static"]
A["System role, tone, compliance rules\nOutput format, safety guidelines"]
end
subgraph Z2["Zone 2 - Session Static"]
B["Shared docs, product knowledge\nTool definitions, policy docs"]
end
subgraph Z3["Zone 3 - Session Dynamic"]
C["Conversation history\nUser preferences, session context"]
end
subgraph Z4["Zone 4 - Request Dynamic"]
D["Current user message\nRetrieved RAG chunks, real-time data"]
end
Z1 --> Z2 --> Z3 --> Z4
style Z1 fill:#166534,color:#fff
style Z2 fill:#1e3a5f,color:#fff
style Z3 fill:#713f12,color:#fff
style Z4 fill:#7f1d1d,color:#fff
Three Core Prompt Architecture Patterns
Pattern 1: Monolithic Static Prompt
All static content in one large system block with a single cache breakpoint at the end. Simple, effective when your stable content exceeds 3,000 tokens and truly never changes between requests.
// Pattern 1: Monolithic - single cache breakpoint
// Best for: single product, fully stable instructions
const buildMonolithic = (systemInstructions, documents) => ({
system: [
{
type: 'text',
text: `${systemInstructions}\n\n## Reference Documents\n\n${
documents.map(d => `### ${d.title}\n${d.content}`).join('\n\n')
}`,
cache_control: { type: 'ephemeral' },
},
],
});Pattern 2: Layered Multi-Breakpoint
Separate your stability zones into distinct blocks with individual cache breakpoints. When Zone 2 content changes (a document update) but Zone 1 stays the same, Zone 1 remains cached. Claude Sonnet 4.6 supports up to four breakpoints, making this pattern well-suited for complex enterprise prompts with multiple distinct static layers.
// Pattern 2: Layered - separate breakpoints per stability zone
// Best for: complex enterprise apps, documents that update independently
const buildLayered = (coreInstructions, knowledgeBase, userPreferences, userMessage) => ({
system: [
{
type: 'text',
text: coreInstructions, // Zone 1: permanent static
cache_control: { type: 'ephemeral' }, // Breakpoint 1
},
{
type: 'text',
text: knowledgeBase, // Zone 2: session static
cache_control: { type: 'ephemeral' }, // Breakpoint 2
},
],
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: userPreferences, // Zone 3: session dynamic
cache_control: { type: 'ephemeral' }, // Breakpoint 3
},
{
type: 'text',
text: userMessage, // Zone 4: request dynamic - no cache_control
},
],
},
],
});Pattern 3: Composable Prompt Assembly
Rather than building prompts as strings, treat them as typed objects assembled from discrete, versioned components. Each component knows its stability zone. The assembler constructs the final prompt in the correct order and places cache breakpoints automatically based on zone boundaries. This pattern makes prompt versioning, testing, and cache debugging significantly easier.
// Pattern 3: Composable prompt assembly
// Best for: large teams, multiple products, prompts treated as versioned code
class PromptComponent {
constructor({ zone, content, version }) {
this.zone = zone; // 1 | 2 | 3 | 4
this.content = content;
this.version = version; // for cache invalidation tracking
}
}
class PromptAssembler {
constructor() {
this.components = [];
}
add(component) {
this.components.push(component);
return this;
}
build(provider = 'claude') {
// Sort by zone to enforce static-first ordering
const sorted = [...this.components].sort((a, b) => a.zone - b.zone);
if (provider === 'claude') {
return this._buildClaude(sorted);
}
return this._buildOpenAI(sorted);
}
_buildClaude(components) {
const systemComponents = components.filter(c => c.zone <= 2);
const messageComponents = components.filter(c => c.zone >= 3);
const system = systemComponents.map((c, i) => ({
type: 'text',
text: c.content,
// Cache all zone 1 and 2 content
cache_control: { type: 'ephemeral' },
}));
const userContent = messageComponents.map(c => ({
type: 'text',
text: c.content,
// Cache zone 3 (session dynamic) but not zone 4
...(c.zone === 3 && { cache_control: { type: 'ephemeral' } }),
}));
return { system, messages: [{ role: 'user', content: userContent }] };
}
_buildOpenAI(components) {
// OpenAI: static-first ordering only, no explicit markers needed
const systemText = components
.filter(c => c.zone <= 2)
.map(c => c.content)
.join('\n\n');
const messages = [{ role: 'system', content: systemText }];
components
.filter(c => c.zone === 3)
.forEach(c => messages.push({ role: 'user', content: c.content }));
return { messages };
}
}
// Usage
const assembler = new PromptAssembler()
.add(new PromptComponent({
zone: 1,
content: 'You are an enterprise data analyst...',
version: 'v2.1.0',
}))
.add(new PromptComponent({
zone: 2,
content: 'Reference: Internal Data Dictionary v4.2...',
version: 'v4.2.0',
}))
.add(new PromptComponent({
zone: 3,
content: conversationHistory,
version: sessionId,
}))
.add(new PromptComponent({
zone: 4,
content: userQuery,
version: requestId,
}));
const claudePrompt = assembler.build('claude');
const openAIPrompt = assembler.build('openai');Cache-Aware RAG Pipeline Design
RAG is where context engineering decisions have the highest cost impact. The naive implementation retrieves documents per query and injects them directly before the user message. This puts Zone 2 content (documents) after Zone 3 content (history) in practice, breaking the static-first requirement and preventing document caching entirely.
flowchart LR
subgraph NaiveRAG["Naive RAG - Cache Unfriendly"]
direction TB
N1["System Prompt"] --> N2["Conversation History"]
N2 --> N3["Retrieved Docs per query"]
N3 --> N4["User Query"]
N3 -.->|"Docs change every request\nCache miss every time"| X1["No caching possible\non documents"]
end
subgraph CacheRAG["Cache-Aware RAG - Cache Friendly"]
direction TB
C1["System Prompt\nCache BP1"] --> C2["Shared Base Docs\nCache BP2"]
C2 --> C3["Retrieved Docs\nfor this session\nCache BP3"]
C3 --> C4["User Query\nNo cache"]
C2 -.->|"Shared docs cached\nacross all users"| Y1["Cached once\nReused for all queries"]
end
style NaiveRAG fill:#fee2e2,stroke:#ef4444
style CacheRAG fill:#dcfce7,stroke:#22c55e
The cache-aware RAG pattern separates documents into two tiers. Shared base documents (product manuals, policy docs, common knowledge) go into Zone 2 before conversation history. Per-query retrieved chunks that are unique to each request go into Zone 4 alongside the user message. This way, the expensive shared documents are cached and reused across users, while only the small retrieved chunks vary per request.
# cache_aware_rag.py
import anthropic
client = anthropic.Anthropic()
class CacheAwareRAG:
"""
Separates documents into shared base docs (Zone 2, cached)
and per-query retrieved chunks (Zone 4, not cached).
"""
def __init__(self, system_prompt: str, base_documents: list[dict]):
self.system_prompt = system_prompt
self.base_documents = base_documents
self.conversation_history: list[dict] = []
def _build_base_doc_block(self) -> dict:
"""Zone 2: shared docs, always cached."""
combined = "\n\n".join(
f"## {doc['title']}\n{doc['content']}"
for doc in self.base_documents
)
return {
"type": "text",
"text": f"## Base Knowledge\n\n{combined}",
"cache_control": {"type": "ephemeral"},
}
def query(self, user_question: str, retrieved_chunks: list[str]) -> str:
"""
retrieved_chunks: small, query-specific docs from vector search.
These go in Zone 4 - no caching.
base_documents: large shared docs in Zone 2 - always cached.
"""
user_content = []
# Zone 2: shared base docs (cached via system)
# Zone 3: conversation history (cached on last assistant turn)
# Zone 4: retrieved chunks + user question (never cached)
if retrieved_chunks:
retrieved_text = "\n\n".join(
f"### Retrieved Context {i+1}\n{chunk}"
for i, chunk in enumerate(retrieved_chunks)
)
user_content.append({
"type": "text",
"text": retrieved_text,
# No cache_control - changes every request
})
user_content.append({
"type": "text",
"text": user_question,
})
# Build messages with cached conversation history
messages = list(self.conversation_history)
# Cache breakpoint on last assistant turn if history exists
if messages and messages[-1]["role"] == "assistant":
last = messages[-1]
if isinstance(last["content"], str):
messages[-1] = {
"role": "assistant",
"content": [
{
"type": "text",
"text": last["content"],
"cache_control": {"type": "ephemeral"},
}
],
}
messages.append({"role": "user", "content": user_content})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=[
{
"type": "text",
"text": self.system_prompt,
"cache_control": {"type": "ephemeral"}, # Breakpoint 1
},
self._build_base_doc_block(), # Breakpoint 2
],
messages=messages,
)
answer = response.content[0].text
# Update history
self.conversation_history.append({"role": "user", "content": user_question})
self.conversation_history.append({"role": "assistant", "content": answer})
usage = response.usage
print(f"[RAG] cached={getattr(usage, 'cache_read_input_tokens', 0)} "
f"created={getattr(usage, 'cache_creation_input_tokens', 0)} "
f"input={usage.input_tokens}")
return answerPrompt Versioning as Code
One of the most impactful practices for production prompt management is treating prompts as versioned code artifacts rather than strings embedded in application logic. This gives you change history, rollback capability, A/B testing, and the ability to trace cache invalidation back to specific prompt changes.
// prompt-registry.js
// Treat prompts as versioned, typed artifacts stored outside application code
const promptRegistry = {
'system.core': {
version: '2.3.1',
zone: 1,
lastModified: '2026-03-01',
content: `You are an enterprise technical assistant...`,
// Changing this invalidates ALL caches using this prompt
},
'knowledge.product-v4': {
version: '4.2.0',
zone: 2,
lastModified: '2026-02-15',
content: `## Product Knowledge Base v4.2\n...`,
// Changing this only invalidates Zone 2 caches
},
'knowledge.policy-2026': {
version: '1.0.0',
zone: 2,
lastModified: '2026-01-10',
content: `## Compliance Policies 2026\n...`,
},
};
// Cache key includes prompt versions - version change = automatic cache invalidation
function buildCacheKey(promptKeys) {
const versions = promptKeys.map(k => `${k}@${promptRegistry[k].version}`);
return versions.join('|');
}
// Detect when a prompt change requires cache warming
function detectCacheImpact(changedPromptKey) {
const prompt = promptRegistry[changedPromptKey];
return {
promptKey: changedPromptKey,
zone: prompt.zone,
impact: prompt.zone === 1
? 'ALL caches will miss on first request after deployment'
: `Zone ${prompt.zone} caches will miss, Zone 1 remains warm`,
recommendation: prompt.zone === 1
? 'Consider cache pre-warming after deployment'
: 'Incremental cache rebuild will occur naturally',
};
}Token Budget Management
With GPT-5.4 supporting 1 million token context windows and Gemini 3.1 Pro matching this, the temptation is to put everything into context and let the model figure it out. This is expensive, slow, and often produces worse results than a well-curated context. Token budget management means treating your context window as a finite resource allocated deliberately across your stability zones.
| Zone | Suggested Budget | Rationale |
|---|---|---|
| Zone 1: Permanent Static | 1,000 – 3,000 tokens | Instructions do not need to be long to be effective |
| Zone 2: Session Static | 5,000 – 50,000 tokens | Documents and knowledge, cached, amortised across requests |
| Zone 3: Session Dynamic | 2,000 – 10,000 tokens | Conversation history – trim oldest turns when limit approached |
| Zone 4: Request Dynamic | 500 – 3,000 tokens | Retrieved chunks and user message – keep focused and relevant |
// token-budget.js
// Simple token budget enforcer for conversation history trimming
const TOKEN_BUDGETS = {
zone1: 3000,
zone2: 30000,
zone3: 8000,
zone4: 2000,
};
// Rough token estimator (4 chars per token average)
function estimateTokens(text) {
return Math.ceil(text.length / 4);
}
function trimConversationHistory(history, maxTokens = TOKEN_BUDGETS.zone3) {
let totalTokens = 0;
const trimmed = [];
// Walk from most recent backwards, keep until budget exhausted
for (let i = history.length - 1; i >= 0; i--) {
const turnTokens = estimateTokens(
typeof history[i].content === 'string'
? history[i].content
: JSON.stringify(history[i].content)
);
if (totalTokens + turnTokens > maxTokens) break;
trimmed.unshift(history[i]);
totalTokens += turnTokens;
}
// Always keep pairs (user + assistant) to avoid orphaned turns
if (trimmed.length > 0 && trimmed[0].role === 'assistant') {
trimmed.shift();
}
return trimmed;
}
// RAG chunk selector - choose most relevant chunks within budget
function selectRAGChunks(chunks, maxTokens = TOKEN_BUDGETS.zone4 - 500) {
let totalTokens = 0;
const selected = [];
// Chunks should already be sorted by relevance score
for (const chunk of chunks) {
const tokens = estimateTokens(chunk.content);
if (totalTokens + tokens > maxTokens) break;
selected.push(chunk);
totalTokens += tokens;
}
return selected;
}Context Hygiene
Context hygiene is the practice of auditing what is actually in your context window versus what should be there. Over time, prompts accumulate instructions added to fix specific edge cases, old caveats that are no longer relevant, and redundant phrasing. Each of these costs tokens on every request and can reduce model quality by diluting the signal in your instructions.
Run a context audit periodically. For each element in your Zone 1 and Zone 2 content, ask: is this still relevant, does it overlap with something else, and is it expressed as concisely as it could be? A 20 percent reduction in your system prompt length is 20 percent fewer tokens on every single request, forever.
Also audit for dynamic content that has crept into static zones. Common culprits include formatted dates, environment names, user tier labels, and feature flags embedded in the system message. Each of these breaks caching silently and is easy to miss during code review.
// context-auditor.js
// Detect dynamic content in static zones before deployment
const DYNAMIC_PATTERNS = [
{ pattern: /\d{4}-\d{2}-\d{2}/, label: 'Date (YYYY-MM-DD)' },
{ pattern: /\d{2}:\d{2}:\d{2}/, label: 'Time (HH:MM:SS)' },
{ pattern: /user_id[:\s]+\w+/i, label: 'User ID' },
{ pattern: /session[:\s]+[a-f0-9-]{8,}/i, label: 'Session ID' },
{ pattern: /request[:\s]+[a-f0-9-]{8,}/i, label: 'Request ID' },
{ pattern: /env(ironment)?[:\s]+(prod|staging|dev)/i, label: 'Environment name' },
{ pattern: /tier[:\s]+(free|pro|enterprise)/i, label: 'User tier' },
];
export function auditStaticContent(promptText, zone = 'unknown') {
const warnings = [];
for (const { pattern, label } of DYNAMIC_PATTERNS) {
if (pattern.test(promptText)) {
warnings.push({
zone,
issue: `Potential dynamic content detected: ${label}`,
pattern: pattern.toString(),
recommendation: 'Move this to Zone 4 or remove it from static content',
});
}
}
const tokenEstimate = Math.ceil(promptText.length / 4);
return {
tokenEstimate,
warnings,
clean: warnings.length === 0,
summary: warnings.length === 0
? `Zone ${zone}: Clean (${tokenEstimate} tokens)`
: `Zone ${zone}: ${warnings.length} warning(s) found (${tokenEstimate} tokens)`,
};
}Cache Pre-Warming on Deployment
When you deploy a new version of your application with updated Zone 1 or Zone 2 content, all existing cache entries for those zones are invalidated. The first request from each user after deployment pays full processing cost. For high-traffic applications, this deployment spike can be significant.
Cache pre-warming solves this by sending synthetic requests immediately after deployment that prime the cache before real user traffic arrives. You only need to do this once per cache TTL window per provider.
// cache-warmer.js
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
export async function warmCache({ systemPrompt, sharedDocuments }) {
console.log('[CacheWarmer] Priming cache after deployment...');
// A minimal synthetic request - just enough to trigger cache write
// The actual response does not matter; only the cache write matters
const warmupRequest = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 10,
system: [
{
type: 'text',
text: systemPrompt,
cache_control: { type: 'ephemeral' },
},
...sharedDocuments.map((doc, i) => ({
type: 'text',
text: doc.content,
...(i === sharedDocuments.length - 1 && {
cache_control: { type: 'ephemeral' },
}),
})),
],
messages: [{ role: 'user', content: 'Ready.' }],
});
const created = warmupRequest.usage.cache_creation_input_tokens || 0;
console.log(`[CacheWarmer] Done. Cached ${created.toLocaleString()} tokens.`);
return { cachedTokens: created };
}
// Run as part of your deployment pipeline
// e.g., in a post-deploy script or health check endpointProvider-Specific Considerations
The static-first principle applies universally, but each provider has nuances worth accounting for in your architecture decisions.
With Claude Sonnet 4.6, you have explicit control and up to four breakpoints. Use this to create separate cache boundaries between your Zone 1 instructions, Zone 2 documents, and growing conversation history. The explicit control makes debugging easier: if your hit rate drops, you know exactly which breakpoint to investigate.
With GPT-5.4, caching is automatic and requires no structural changes. Your only lever is prompt ordering. Keep your system message fully static, never personalise it per request, and ensure your tool definitions are consistent across calls. Tool Search reduces the token overhead of large tool sets automatically.
With Gemini 3.1 Pro, you have two modes. For shared documents, create explicit cache objects that all users share. For conversation history, rely on implicit caching. Match your TTL to your actual traffic pattern and delete explicit caches promptly when batch jobs complete to avoid unnecessary storage charges.
What Is Next
Part 7 builds a unified multi-provider AI gateway in Node.js that abstracts over all three caching approaches. You define your prompts once and the gateway handles provider-specific caching configuration, routing, fallback, and cross-provider cost tracking transparently. This is the pattern enterprise teams use when they want flexibility to route between providers without rewriting application logic.
References
- Anthropic – “Prompt Caching Documentation” (https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)
- OpenAI – “Introducing GPT-5.4” (https://openai.com/index/introducing-gpt-5-4/)
- DigitalOcean – “Prompt Caching Explained: OpenAI, Claude, and Gemini” (https://www.digitalocean.com/community/tutorials/prompt-caching-explained)
- The AI Corner – “Context Engineering Guide 2026” (https://www.the-ai-corner.com/p/context-engineering-guide-2026)
- arXiv – “Don’t Break the Cache: Prompt Caching for Long-Horizon Agentic Tasks” (https://arxiv.org/html/2601.06007v1)
- OpenRouter – “Prompt Caching Best Practices” (https://openrouter.ai/docs/guides/best-practices/prompt-caching)
