Prompt Caching with GPT-5.4: Automatic Caching, Tool Search, and C# Production Implementation

Prompt Caching with GPT-5.4: Automatic Caching, Tool Search, and C# Production Implementation

Part 2 covered Claude Sonnet 4.6’s explicit cache_control approach where you decide exactly what gets cached. GPT-5.4 takes the opposite philosophy: caching is fully automatic. You do not add any markers, you do not configure TTL, and you do not change your request structure. The API detects repeated prefixes and serves them from cache on its own.

The trade-off is less developer control but simpler setup. The upside is that any application already using GPT-5.4 is potentially benefiting from caching right now without doing anything extra. The challenge is understanding how to structure your prompts so the automatic system actually finds those repeated prefixes and delivers real savings.

This part covers how GPT-5.4’s automatic caching works, what the new Tool Search feature means for agent developers, the practical rules for maximising cache hit rates, and a complete production C# implementation with cost tracking and monitoring.

How GPT-5.4 Automatic Caching Works

When you send a request to GPT-5.4, OpenAI’s infrastructure checks whether the beginning of your prompt matches a recently cached entry. If it does, the model loads the stored KV tensors for that prefix and only processes the new tokens at the end. The cache check and loading happen transparently with no change to your API call.

Three conditions must be met for a cache hit to occur. First, your prompt must exceed 1,024 tokens. Shorter prompts are never cached. Second, the matching prefix must be exact, byte for byte, with no differences in whitespace, punctuation, or encoding. Third, the cache entry must still be active. GPT-5.4 cache entries typically remain warm for 5 to 10 minutes of inactivity, with longer persistence possible during off-peak periods.

The API response signals cache activity through the usage object. A cached_tokens field under prompt_tokens_details tells you how many of your input tokens were served from cache. If this value is greater than zero, you got a cache hit and paid the reduced rate of $0.625 per million tokens instead of the standard $2.50 per million.

sequenceDiagram
    participant App as Your Application
    participant GW as OpenAI Gateway
    participant Cache as Prefix Cache
    participant Model as GPT-5.4

    App->>GW: POST /v1/chat/completions
    GW->>Cache: Hash prompt prefix, check cache
    alt Cache Hit
        Cache-->>GW: KV tensors for prefix
        GW->>Model: Process only new tokens
        Model-->>GW: Response
        GW-->>App: Response with cached_tokens N
    else Cache Miss
        Cache-->>GW: No match
        GW->>Model: Process full prompt
        Model->>Cache: Store KV tensors
        Model-->>GW: Response
        GW-->>App: Response with cached_tokens 0
    end

Tool Search: GPT-5.4’s Answer to Agent Token Bloat

One of the most practically significant features in GPT-5.4 for enterprise developers is Tool Search. In earlier models, every API call in an agentic workflow required you to send the full definition of every available tool in the request. For systems with dozens of tools, this could add thousands of tokens to every single request, making caching harder and costs much higher.

Tool Search changes the model so it can look up tool definitions on demand rather than loading them all upfront. You register your tools with the API, and GPT-5.4 fetches only the definitions it needs for each step of a workflow. The result is smaller requests, better cache hit rates on the static portion of your prompt, and lower per-request costs for complex agent systems.

flowchart LR
    subgraph Before["Before Tool Search (GPT-5.2)"]
        direction TB
        P1["System Prompt\n+ ALL 30 Tool Definitions\n+ Conversation History\n+ User Query"]
        P1 --> T1["~8,000 tokens per request\nPoor cache hit rate"]
    end

    subgraph After["After Tool Search (GPT-5.4)"]
        direction TB
        P2["System Prompt\n+ Conversation History\n+ User Query"]
        P2 --> L1["Model looks up\nonly needed tools"]
        L1 --> T2["~2,000 tokens per request\nHigh cache hit rate"]
    end

    style Before fill:#fee2e2,stroke:#ef4444
    style After fill:#dcfce7,stroke:#22c55e

Structuring Prompts for Maximum Cache Hit Rates

Because GPT-5.4’s caching is automatic and based on exact prefix matching, prompt structure is everything. The system can only cache what repeats. If your system message changes between requests, the cache will never warm up. Here are the rules that matter most in practice.

Keep your system message completely static. No timestamps, request IDs, user names, or session variables in the system message. If you need to personalise behaviour, put that in the first user message instead.

Order messages consistently. The cache matches prefixes in the order messages appear in your array. System message first, then any shared context, then conversation history, then the current user message. Never reorder existing messages.

Use consistent serialisation. If you build message content programmatically, make sure the output is deterministic. Avoid serialising objects whose key order might vary between runs or environments.

Keep conversation history append-only. Add new turns to the end of the history array. Never insert, reorder, or modify earlier turns once they have been sent. Any change to the history before the current message breaks the cached prefix.

C# Production Implementation

Install the OpenAI .NET SDK:

dotnet add package Azure.AI.OpenAI
dotnet add package OpenAI

Here is the core caching client with full cost tracking:

// GptCacheClient.cs
using OpenAI;
using OpenAI.Chat;
using System.ClientModel;

public class GptCacheClient
{
    private readonly ChatClient _client;

    // GPT-5.4 pricing per million tokens (USD)
    private const decimal InputPricePerMillion = 2.50m;
    private const decimal CachedInputPricePerMillion = 0.625m;
    private const decimal OutputPricePerMillion = 20.00m;

    private readonly CacheMetrics _metrics = new();

    public GptCacheClient(string apiKey)
    {
        var openAiClient = new OpenAIClient(new ApiKeyCredential(apiKey));
        _client = openAiClient.GetChatClient("gpt-5.4");
    }

    public async Task<CacheAwareResponse> CompleteAsync(
        string systemPrompt,
        IEnumerable<ConversationTurn> history,
        string userMessage)
    {
        var messages = BuildMessages(systemPrompt, history, userMessage);

        var response = await _client.CompleteChatAsync(messages);
        var result = response.Value;

        var usage = result.Usage;
        var cachedTokens = usage.InputTokenDetails?.CachedTokenCount ?? 0;
        var inputTokens = usage.InputTokenCount;
        var outputTokens = usage.OutputTokenCount;
        var nonCachedInput = inputTokens - cachedTokens;

        var cost = CalculateCost(nonCachedInput, cachedTokens, outputTokens);
        var costWithoutCache = CalculateCostWithoutCache(inputTokens, outputTokens);

        _metrics.Record(cachedTokens, inputTokens, outputTokens, cost.Total);

        return new CacheAwareResponse
        {
            Content = result.Content[0].Text,
            Usage = new TokenUsage
            {
                InputTokens = inputTokens,
                CachedTokens = cachedTokens,
                NonCachedInputTokens = nonCachedInput,
                OutputTokens = outputTokens,
                CacheHit = cachedTokens > 0,
                CacheHitRate = inputTokens > 0
                    ? (double)cachedTokens / inputTokens * 100
                    : 0
            },
            Cost = new RequestCost
            {
                ActualUsd = cost.Total,
                WithoutCacheUsd = costWithoutCache,
                SavingsUsd = costWithoutCache - cost.Total
            },
            CumulativeMetrics = _metrics.GetSummary()
        };
    }

    private static List<ChatMessage> BuildMessages(
        string systemPrompt,
        IEnumerable<ConversationTurn> history,
        string userMessage)
    {
        var messages = new List<ChatMessage>
        {
            // System message must be fully static for cache hits
            new SystemChatMessage(systemPrompt)
        };

        // Append history in original order - never reorder
        foreach (var turn in history)
        {
            if (turn.Role == "user")
                messages.Add(new UserChatMessage(turn.Content));
            else
                messages.Add(new AssistantChatMessage(turn.Content));
        }

        // Current user message always last
        messages.Add(new UserChatMessage(userMessage));

        return messages;
    }

    private static (decimal Total, decimal Input, decimal Cached, decimal Output) CalculateCost(
        int nonCachedInput, int cachedTokens, int outputTokens)
    {
        var inputCost = nonCachedInput / 1_000_000m * InputPricePerMillion;
        var cachedCost = cachedTokens / 1_000_000m * CachedInputPricePerMillion;
        var outputCost = outputTokens / 1_000_000m * OutputPricePerMillion;
        return (inputCost + cachedCost + outputCost, inputCost, cachedCost, outputCost);
    }

    private static decimal CalculateCostWithoutCache(int inputTokens, int outputTokens)
    {
        return inputTokens / 1_000_000m * InputPricePerMillion
             + outputTokens / 1_000_000m * OutputPricePerMillion;
    }

    public CacheMetricsSummary GetMetrics() => _metrics.GetSummary();
}

public record ConversationTurn(string Role, string Content);

public class CacheAwareResponse
{
    public string Content { get; init; } = string.Empty;
    public TokenUsage Usage { get; init; } = new();
    public RequestCost Cost { get; init; } = new();
    public CacheMetricsSummary CumulativeMetrics { get; init; } = new();
}

public class TokenUsage
{
    public int InputTokens { get; init; }
    public int CachedTokens { get; init; }
    public int NonCachedInputTokens { get; init; }
    public int OutputTokens { get; init; }
    public bool CacheHit { get; init; }
    public double CacheHitRate { get; init; }
}

public class RequestCost
{
    public decimal ActualUsd { get; init; }
    public decimal WithoutCacheUsd { get; init; }
    public decimal SavingsUsd { get; init; }
}

Cache Metrics Tracking

// CacheMetrics.cs
public class CacheMetrics
{
    private int _totalRequests;
    private int _cacheHits;
    private long _totalInputTokens;
    private long _totalCachedTokens;
    private long _totalOutputTokens;
    private decimal _totalCostUsd;
    private decimal _totalSavingsUsd;
    private readonly object _lock = new();

    public void Record(int cachedTokens, int inputTokens, int outputTokens, decimal costUsd)
    {
        lock (_lock)
        {
            _totalRequests++;
            _totalInputTokens += inputTokens;
            _totalCachedTokens += cachedTokens;
            _totalOutputTokens += outputTokens;
            _totalCostUsd += costUsd;

            if (cachedTokens > 0)
                _cacheHits++;

            // Savings = what full input cost would have been minus cached read cost
            var fullCost = inputTokens / 1_000_000m * 2.50m;
            var cachedPortionAtFull = cachedTokens / 1_000_000m * 2.50m;
            var cachedPortionActual = cachedTokens / 1_000_000m * 0.625m;
            _totalSavingsUsd += cachedPortionAtFull - cachedPortionActual;
        }
    }

    public CacheMetricsSummary GetSummary()
    {
        lock (_lock)
        {
            return new CacheMetricsSummary
            {
                TotalRequests = _totalRequests,
                CacheHits = _cacheHits,
                HitRatePercent = _totalRequests > 0
                    ? Math.Round((double)_cacheHits / _totalRequests * 100, 1)
                    : 0,
                TokenCacheEfficiencyPercent = _totalInputTokens > 0
                    ? Math.Round((double)_totalCachedTokens / _totalInputTokens * 100, 1)
                    : 0,
                TotalCachedTokens = _totalCachedTokens,
                TotalInputTokens = _totalInputTokens,
                TotalCostUsd = Math.Round(_totalCostUsd, 6),
                EstimatedSavingsUsd = Math.Round(_totalSavingsUsd, 6)
            };
        }
    }
}

public class CacheMetricsSummary
{
    public int TotalRequests { get; init; }
    public int CacheHits { get; init; }
    public double HitRatePercent { get; init; }
    public double TokenCacheEfficiencyPercent { get; init; }
    public long TotalCachedTokens { get; init; }
    public long TotalInputTokens { get; init; }
    public decimal TotalCostUsd { get; init; }
    public decimal EstimatedSavingsUsd { get; init; }
}

Multi-Turn Conversation Manager

// CachedConversation.cs
public class CachedConversation
{
    private readonly GptCacheClient _client;
    private readonly string _systemPrompt;
    private readonly List<ConversationTurn> _history = new();
    private decimal _totalSavings;

    public CachedConversation(GptCacheClient client, string systemPrompt)
    {
        _client = client;
        _systemPrompt = systemPrompt;
    }

    public async Task<TurnResult> SendAsync(string userMessage)
    {
        var response = await _client.CompleteAsync(
            _systemPrompt,
            _history,
            userMessage);

        // Append to history for next turn - order must never change
        _history.Add(new ConversationTurn("user", userMessage));
        _history.Add(new ConversationTurn("assistant", response.Content));

        _totalSavings += response.Cost.SavingsUsd;

        return new TurnResult
        {
            Reply = response.Content,
            CacheHit = response.Usage.CacheHit,
            CachedTokens = response.Usage.CachedTokens,
            TurnNumber = _history.Count / 2,
            TurnSavingsUsd = response.Cost.SavingsUsd,
            CumulativeSavingsUsd = _totalSavings,
            Metrics = response.CumulativeMetrics
        };
    }

    public void Reset() => _history.Clear();

    public int TurnCount => _history.Count / 2;
}

public class TurnResult
{
    public string Reply { get; init; } = string.Empty;
    public bool CacheHit { get; init; }
    public int CachedTokens { get; init; }
    public int TurnNumber { get; init; }
    public decimal TurnSavingsUsd { get; init; }
    public decimal CumulativeSavingsUsd { get; init; }
    public CacheMetricsSummary Metrics { get; init; } = new();
}

Tool Search Integration for Agentic Workflows

With Tool Search enabled, you no longer need to include all tool definitions in every request. Here is how to structure an agent that takes advantage of this in C#:

// ToolSearchAgent.cs
using OpenAI;
using OpenAI.Chat;
using System.ClientModel;
using System.Text.Json;

public class ToolSearchAgent
{
    private readonly ChatClient _client;

    // Static system prompt - must never change for cache hits
    private const string SystemPrompt = @"You are an enterprise data analyst assistant.
You have access to tools for querying databases, generating reports, and sending notifications.
Always verify data before presenting results.
Use tools sequentially and explain each step.";

    public ToolSearchAgent(string apiKey)
    {
        var openAi = new OpenAIClient(new ApiKeyCredential(apiKey));
        _client = openAi.GetChatClient("gpt-5.4");
    }

    public async Task<string> RunAsync(string userTask)
    {
        var messages = new List<ChatMessage>
        {
            new SystemChatMessage(SystemPrompt),
            new UserChatMessage(userTask)
        };

        // Tool Search: pass tools but GPT-5.4 only loads
        // definitions for tools it actually needs each step
        var tools = GetToolDefinitions();

        var options = new ChatCompletionOptions();
        foreach (var tool in tools)
            options.Tools.Add(tool);

        // Agentic loop
        while (true)
        {
            var response = await _client.CompleteChatAsync(messages, options);
            var result = response.Value;

            // Log cache efficiency on each agent step
            var cached = result.Usage.InputTokenDetails?.CachedTokenCount ?? 0;
            Console.WriteLine($"[Agent Step] Cached tokens: {cached} / {result.Usage.InputTokenCount}");

            if (result.FinishReason == ChatFinishReason.Stop)
                return result.Content[0].Text;

            if (result.FinishReason == ChatFinishReason.ToolCalls)
            {
                messages.Add(new AssistantChatMessage(result));

                foreach (var toolCall in result.ToolCalls)
                {
                    var toolResult = ExecuteTool(toolCall.FunctionName, toolCall.FunctionArguments);
                    messages.Add(new ToolChatMessage(toolCall.Id, toolResult));
                }
            }
            else break;
        }

        return "Agent completed without a final response.";
    }

    private string ExecuteTool(string name, BinaryData args)
    {
        // In production, dispatch to real implementations
        return name switch
        {
            "query_database" => JsonSerializer.Serialize(new { rows = 142, status = "ok" }),
            "generate_report" => JsonSerializer.Serialize(new { reportId = "RPT-2026-001", url = "https://reports.internal/RPT-2026-001" }),
            "send_notification" => JsonSerializer.Serialize(new { sent = true, recipients = 3 }),
            _ => JsonSerializer.Serialize(new { error = $"Unknown tool: {name}" })
        };
    }

    private static List<ChatTool> GetToolDefinitions() =>
    [
        ChatTool.CreateFunctionTool(
            "query_database",
            "Query the enterprise data warehouse with a SQL-like filter",
            BinaryData.FromString("""
            {
                "type": "object",
                "properties": {
                    "table": { "type": "string", "description": "Table name to query" },
                    "filter": { "type": "string", "description": "Filter condition" },
                    "limit": { "type": "integer", "description": "Max rows to return" }
                },
                "required": ["table"]
            }
            """)),
        ChatTool.CreateFunctionTool(
            "generate_report",
            "Generate a formatted PDF report from query results",
            BinaryData.FromString("""
            {
                "type": "object",
                "properties": {
                    "title": { "type": "string" },
                    "data_source": { "type": "string" },
                    "format": { "type": "string", "enum": ["pdf", "xlsx", "csv"] }
                },
                "required": ["title", "data_source"]
            }
            """)),
        ChatTool.CreateFunctionTool(
            "send_notification",
            "Send a notification to specified recipients",
            BinaryData.FromString("""
            {
                "type": "object",
                "properties": {
                    "message": { "type": "string" },
                    "channel": { "type": "string", "enum": ["email", "slack", "teams"] },
                    "recipients": { "type": "array", "items": { "type": "string" } }
                },
                "required": ["message", "channel", "recipients"]
            }
            """))
    ];
}

Complete Usage Example

// Program.cs
var apiKey = Environment.GetEnvironmentVariable("OPENAI_API_KEY")
    ?? throw new InvalidOperationException("OPENAI_API_KEY not set");

const string SystemPrompt = @"You are a senior cloud architect specialising in Azure and AWS.
Provide precise, production-focused guidance with working code examples.
Always consider security, cost, and operational concerns in your recommendations.
Format responses with clear headings. Be direct and concise.";

var client = new GptCacheClient(apiKey);
var conversation = new CachedConversation(client, SystemPrompt);

var questions = new[]
{
    "What is the best approach for zero-downtime deployments on Azure AKS?",
    "How should I handle secrets rotation in that setup?",
    "What monitoring should I put in place for the deployment pipeline?"
};

foreach (var question in questions)
{
    Console.WriteLine($"\nUser: {question}");
    var result = await conversation.SendAsync(question);

    Console.WriteLine($"Assistant: {result.Reply[..Math.Min(200, result.Reply.Length)]}...");
    Console.WriteLine($"Turn {result.TurnNumber} | Cache hit: {result.CacheHit} | " +
                      $"Cached tokens: {result.CachedTokens} | " +
                      $"Savings: ${result.TurnSavingsUsd:F6}");
}

var metrics = client.GetMetrics();
Console.WriteLine($"\n--- Session Summary ---");
Console.WriteLine($"Hit rate: {metrics.HitRatePercent}%");
Console.WriteLine($"Token cache efficiency: {metrics.TokenCacheEfficiencyPercent}%");
Console.WriteLine($"Total estimated savings: ${metrics.EstimatedSavingsUsd:F4}");

GPT-5.4 vs Claude Sonnet 4.6 Caching: Key Differences

Now that both implementations are on the table, here is where they differ in ways that affect your architecture decisions.

FeatureGPT-5.4Claude Sonnet 4.6
Configuration requiredNone – fully automaticExplicit cache_control markers
Cache write costNo extra charge+25% on cached tokens
Cache read discount75% off ($0.625 vs $2.50/M)90% off
Default TTL5-10 minutes inactivity5 minutes (reset on hit)
Extended TTL optionNo explicit controlYes, 1 hour at extra cost
Max breakpointsN/A (automatic)4 explicit breakpoints
Min prompt size1,024 tokens1,024 tokens
Tool overhead reductionYes via Tool SearchManual tool list management
Developer controlLow (automatic)High (explicit)

In practice, GPT-5.4’s automatic approach is easier to adopt and requires no changes to existing codebases. Claude’s explicit approach gives you more predictable cache behaviour and higher read discounts, which matters more at very high token volumes. If you are running a multi-provider architecture, Part 7 of this series covers how to build a unified gateway that handles both approaches transparently.

Common Pitfalls with GPT-5.4 Caching

Injecting dynamic values into the system message. This is the most common mistake. Anything that varies per request, including user locale, feature flags, or A/B test variants, will break the cache if placed in the system message. Move all dynamic personalisation to the user message.

Low traffic volume. If your application makes fewer than one request per minute, cache entries will expire before the next request arrives. Automatic caching only delivers real savings at sufficient request rates. Calculate your expected requests per TTL window before projecting cost reductions.

Not reading the cached_tokens field. Without instrumenting the response, you have no visibility into whether caching is actually working. Always log prompt_tokens_details.cached_tokens and track the ratio over time.

Assuming Tool Search is enabled by default for all workloads. Tool Search changes how tool definitions are loaded, which affects how you register and maintain tools in your application. Verify that your tool registration approach is compatible with the API version you are targeting.

What Is Next

Part 4 covers Gemini 3.1 Pro and Flash-Lite context caching. Google’s approach sits between Claude’s explicit control and OpenAI’s automatic model, offering both implicit caching and explicit cache objects with configurable TTLs. It is also the only provider that charges for cache storage, so the cost calculation is more involved. The implementation will be in Python.

References

Written by:

612 Posts

View All Posts
Follow Me :