Azure AI Foundry deployments of Claude can quickly become expensive at scale without proper cost management. Understanding the pricing model, implementing intelligent caching, choosing appropriate models, and monitoring usage patterns are essential for sustainable production deployments.
This guide provides actionable strategies for optimizing Claude costs in Azure AI Foundry while maintaining quality and performance.
Understanding Claude Pricing in Azure
Claude models in Azure AI Foundry use token-based pricing with separate costs for input tokens, output tokens, and thinking tokens. Sonnet 4.5 offers the best balance of capability and cost for most workloads. Haiku 4.5 provides faster, cheaper inference for simpler tasks. Opus 4.5 delivers maximum capability at premium pricing.
Cost Optimization Strategies
1. Prompt Caching
Prompt caching reduces costs by 90% and latency by 80% for repeated context blocks. Cache blocks persist for 5 minutes, making this ideal for interactive applications with consistent context.
const optimizedQuery = async (question, documentationContext) => {
// Cache large documentation context
const response = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1000,
system: [
{
type: "text",
text: `You are a technical support agent. Use this documentation:
${documentationContext}`,
cache_control: { type: "ephemeral" }
}
],
messages: [{
role: "user",
content: question
}]
});
// Subsequent calls reuse cached context
return response;
};2. Model Selection Strategy
Route requests to the appropriate model based on complexity. Use Haiku for simple tasks, Sonnet for standard work, and Opus only when necessary.
public class IntelligentModelRouter
{
public string SelectModel(string prompt, TaskComplexity complexity)
{
return complexity switch
{
TaskComplexity.Simple => "claude-haiku-4-5", // $0.25 per MTok
TaskComplexity.Standard => "claude-sonnet-4-5", // $3 per MTok
TaskComplexity.Complex => "claude-opus-4-5", // $15 per MTok
_ => "claude-sonnet-4-5"
};
}
public TaskComplexity AnalyzeComplexity(string prompt)
{
if (prompt.Length < 100) return TaskComplexity.Simple;
if (prompt.Contains("analyze") || prompt.Contains("complex"))
return TaskComplexity.Complex;
return TaskComplexity.Standard;
}
}3. Thinking Budget Control
Extended thinking tokens cost the same as input tokens but can add significant costs. Set appropriate budgets based on task requirements.
const costAwareThinking = (taskType) => {
const budgets = {
extraction: 1000, // Minimal thinking for data extraction
classification: 2000, // Light thinking for classification
analysis: 10000, // Moderate for business analysis
coding: 50000, // Heavy for complex coding
research: 128000 // Maximum for deep research
};
return {
type: "enabled",
budget_tokens: budgets[taskType] || 5000
};
};4. Context Window Management
Large context windows increase input token costs. Implement intelligent summarization and context pruning for long conversations.
async def manage_conversation_context(messages, max_tokens=100000):
current_tokens = estimate_tokens(messages)
if current_tokens <= max_tokens:
return messages
# Keep system prompt and recent messages
system_msg = messages[0]
recent_msgs = messages[-10:]
middle_msgs = messages[1:-10]
# Summarize middle section with Haiku (cheaper)
summary = await client.messages.create(
model="claude-haiku-4-5", # Use cheaper model for summarization
max_tokens=500,
messages=[{
"role": "user",
"content": f"Summarize this conversation: {middle_msgs}"
}]
)
return [
system_msg,
{"role": "user", "content": f"[Previous context: {summary.content[0].text}]"},
*recent_msgs
]5. Batch Processing
Process multiple items in single requests when possible to reduce overhead and API call costs.
// Instead of processing items individually
for (const item of items) {
await processItem(item); // Expensive: N API calls
}
// Batch process
const batchedResults = await batchProcess(items); // Single API call
async function batchProcess(items) {
const prompt = `Process these items and return JSON array:
${items.map((item, idx) => `Item ${idx}: ${item}`).join('\n')}
Return: [{"id": 0, "result": "..."}, ...]`;
const response = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 4000,
messages: [{ role: "user", content: prompt }]
});
return JSON.parse(response.content[0].text);
}Cost Monitoring Implementation
public class CostTracker
{
private readonly Dictionary<string, decimal> _modelPricing = new()
{
["claude-haiku-4-5-input"] = 0.25m,
["claude-haiku-4-5-output"] = 1.25m,
["claude-sonnet-4-5-input"] = 3.00m,
["claude-sonnet-4-5-output"] = 15.00m,
["claude-opus-4-5-input"] = 15.00m,
["claude-opus-4-5-output"] = 75.00m
};
public CostMetrics CalculateCost(string model, Usage usage)
{
var inputCost = (usage.InputTokens / 1_000_000m) *
_modelPricing[$"{model}-input"];
var outputCost = (usage.OutputTokens / 1_000_000m) *
_modelPricing[$"{model}-output"];
var thinkingCost = (usage.ThinkingTokens / 1_000_000m) *
_modelPricing[$"{model}-input"];
return new CostMetrics
{
InputCost = inputCost,
OutputCost = outputCost,
ThinkingCost = thinkingCost,
TotalCost = inputCost + outputCost + thinkingCost,
CacheReadCost = (usage.CacheReadTokens / 1_000_000m) *
(_modelPricing[$"{model}-input"] * 0.1m)
};
}
public async Task LogMetricsAsync(string requestId, CostMetrics metrics)
{
// Log to Application Insights
await _telemetryClient.TrackMetricAsync(new MetricTelemetry
{
Name = "AI.Cost.Total",
Sum = (double)metrics.TotalCost,
Properties = {
["RequestId"] = requestId,
["Model"] = metrics.Model
}
});
}
}Best Practices Summary
- Enable prompt caching for repeated context
- Route to appropriate models based on complexity
- Set thinking budgets based on task requirements
- Implement context window management
- Batch process when possible
- Monitor costs per request type
- Use Haiku for high-volume, simple tasks
- Reserve Opus for truly complex scenarios
Conclusion
Cost optimization for Claude in Azure AI Foundry requires strategic use of caching, intelligent model selection, careful thinking budget management, and continuous monitoring. These techniques can reduce costs by 70-90% while maintaining quality.
