Training costs dominated the AI conversation in 2024. In 2026, inference is the bill that lands on the CFO’s desk. By most enterprise estimates, inference now accounts for 85% of the total AI operations budget — and unlike training, inference scales with every user request, every agentic loop iteration, and every RAG context window that gets stuffed with retrieved documents.
The problem compounds with agentic architectures. A single autonomous agent executing a complex task can consume 1.5 million tokens in one run because every reasoning turn re-injects the entire conversation history. Deploy a fleet of agents and that cost scales across every concurrent session. A runaway agent stuck in an error correction loop can burn through a monthly token budget in hours, and without real-time cost governance, the first signal you get is the end-of-month invoice.
This post builds the cost governance layer: per-feature and per-tenant cost attribution, real-time token budget enforcement with hard limits, spend anomaly detection, and the key optimization levers — semantic caching, model routing, and prompt compression — that reduce spend without degrading the user experience.
The Three FinOps Principles Applied to LLMs
The FinOps Foundation defines three core principles: visibility, accountability, and optimization. Each translates directly to LLM cost governance.
Visibility means you can answer “how much did feature X cost today?” down to the model, the prompt version, and the tenant. Without tagging every API call with structured metadata, your only view is the monthly provider invoice split by model — which tells you almost nothing about where to intervene.
Accountability means cost is attributed to the team or product area that drove it. When a new feature launches and token spend spikes, the feature team should be the first to know — not finance three weeks later. This requires cost allocation by feature label, not just by model.
Optimization means continuously reducing cost-per-unit-of-value, not just total spend. The right question is not “how do we spend less on LLMs?” but “what is the cost of resolving one customer support ticket, and is that cost sustainable?” Shifting that metric to business units moves the conversation from infrastructure cost-cutting to product economics.
Cost Governance Architecture
flowchart TD
A[LLM API Call] --> B[Cost Middleware\nTag: feature, tenant, model, version]
B --> C[Real-time Cost Counter\nPrometheus + Redis]
B --> D[Span Attributes\nOpenTelemetry]
C --> E{Budget Check\nPer-tenant or per-feature}
E -->|Under budget| F[Allow Request]
E -->|Over soft limit| G[Throttle + Alert]
E -->|Over hard limit| H[Block Request\nReturn 429]
C --> I[Cost Aggregator\nRolling 1h / 24h / 30d]
I --> J{Anomaly Detection\nSpend rate spike?}
J -->|Normal| K[Dashboard\nGrafana cost panels]
J -->|Spike detected| L[Alert\nPagerDuty / Slack]
D --> M[Langfuse\nCost per trace]
M --> N[Cost Attribution Report\nPer feature, tenant, prompt version]
F --> O[LLM Provider\nOpenAI / Anthropic / Azure]
O --> P[Response + Usage\ninput_tokens + output_tokens]
P --> C
style E fill:#1e3a5f,color:#ffffff
style J fill:#1e3a5f,color:#ffffff
style H fill:#5a1e1e,color:#ffffff
style L fill:#5a3a1e,color:#ffffff
Token Cost Reference Table (2026)
Accurate cost tracking requires per-model pricing. These are the rates used in the code examples below — verify against current provider pricing pages as rates change.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best for |
|---|---|---|---|
| gpt-4o | $2.50 | $10.00 | Complex reasoning, high-stakes outputs |
| gpt-4o-mini | $0.15 | $0.60 | High-volume classification, routing, simple tasks |
| claude-sonnet-4-6 | $3.00 | $15.00 | Long-context, instruction-following |
| claude-haiku-4-5 | $0.80 | $4.00 | Fast, cost-effective completion tasks |
| text-embedding-3-small | $0.02 | n/a | High-volume RAG query embedding |
| text-embedding-3-large | $0.13 | n/a | High-precision semantic search |
Node.js Implementation: Cost Middleware
// cost-middleware.js
const client = require('prom-client');
const Redis = require('ioredis');
const redis = new Redis(process.env.REDIS_URL);
// Model pricing map (per token)
const MODEL_PRICING = {
'gpt-4o': { input: 0.0000025, output: 0.00001 },
'gpt-4o-mini': { input: 0.00000015, output: 0.0000006 },
'claude-sonnet-4-6': { input: 0.000003, output: 0.000015 },
'claude-haiku-4-5': { input: 0.0000008, output: 0.000004 },
};
// Prometheus metrics
const tokenCostCounter = new client.Counter({
name: 'llm_cost_usd_total',
help: 'Cumulative LLM cost in USD',
labelNames: ['model', 'feature', 'tenant', 'token_type'],
});
const tokenUsageCounter = new client.Counter({
name: 'llm_tokens_total',
help: 'Cumulative token usage',
labelNames: ['model', 'feature', 'tenant', 'token_type'],
});
const budgetExceededCounter = new client.Counter({
name: 'llm_budget_exceeded_total',
help: 'Number of requests blocked or throttled by budget enforcement',
labelNames: ['tenant', 'feature', 'limit_type'],
});
/**
* Calculate cost for a completed LLM call and record metrics.
*/
function recordCost({ model, feature, tenant, inputTokens, outputTokens }) {
const pricing = MODEL_PRICING[model];
if (!pricing) return 0;
const inputCost = inputTokens * pricing.input;
const outputCost = outputTokens * pricing.output;
const totalCost = inputCost + outputCost;
// Prometheus counters
tokenCostCounter.inc({ model, feature, tenant, token_type: 'input' }, inputCost);
tokenCostCounter.inc({ model, feature, tenant, token_type: 'output' }, outputCost);
tokenUsageCounter.inc({ model, feature, tenant, token_type: 'input' }, inputTokens);
tokenUsageCounter.inc({ model, feature, tenant, token_type: 'output' }, outputTokens);
// Rolling spend in Redis for real-time budget enforcement
const hourKey = `spend:${tenant}:${feature}:${new Date().toISOString().slice(0, 13)}`;
const dayKey = `spend:${tenant}:${feature}:${new Date().toISOString().slice(0, 10)}`;
redis.incrbyfloat(hourKey, totalCost).then(() => redis.expire(hourKey, 7200));
redis.incrbyfloat(dayKey, totalCost).then(() => redis.expire(dayKey, 172800));
return totalCost;
}
/**
* Budget enforcement gate -- call BEFORE making the LLM API request.
* Returns { allowed: bool, reason: string }
*/
async function checkBudget(tenant, feature, budgets) {
const dayKey = `spend:${tenant}:${feature}:${new Date().toISOString().slice(0, 10)}`;
const spent = parseFloat(await redis.get(dayKey) || '0');
const tenantBudget = budgets[tenant] || budgets['default'] || { soft: 50, hard: 100 };
if (spent >= tenantBudget.hard) {
budgetExceededCounter.inc({ tenant, feature, limit_type: 'hard' });
return { allowed: false, reason: `Daily hard limit of $${tenantBudget.hard} reached for tenant ${tenant}` };
}
if (spent >= tenantBudget.soft) {
budgetExceededCounter.inc({ tenant, feature, limit_type: 'soft' });
// Throttle: still allow but log warning -- optionally downgrade model here
console.warn(`[BudgetSoft] tenant=${tenant} feature=${feature} spent=$${spent.toFixed(4)} limit=$${tenantBudget.soft}`);
}
return { allowed: true, reason: null };
}
module.exports = { recordCost, checkBudget };
Using the middleware in a request handler:
// chat-handler.js
const { checkBudget, recordCost } = require('./cost-middleware');
const OpenAI = require('openai');
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
// Tenant budget config -- load from database or config service in production
const BUDGETS = {
'tenant-free': { soft: 1.00, hard: 2.00 },
'tenant-pro': { soft: 20.00, hard: 50.00 },
'tenant-enterprise': { soft: 200.00, hard: 500.00 },
'default': { soft: 5.00, hard: 10.00 },
};
async function handleChat(req, res) {
const { message } = req.body;
const tenant = req.headers['x-tenant-id'] || 'default';
const feature = 'customer-support';
const model = 'gpt-4o';
// Check budget BEFORE the API call
const { allowed, reason } = await checkBudget(tenant, feature, BUDGETS);
if (!allowed) {
return res.status(429).json({ error: reason });
}
const completion = await openai.chat.completions.create({
model,
messages: [{ role: 'user', content: message }],
});
// Record actual cost AFTER the API call returns usage
const cost = recordCost({
model,
feature,
tenant,
inputTokens: completion.usage.prompt_tokens,
outputTokens: completion.usage.completion_tokens,
});
res.json({
response: completion.choices[0].message.content,
cost_usd: cost.toFixed(6),
});
}
Python Implementation
# cost_middleware.py
import os
import time
from datetime import datetime, timezone
from prometheus_client import Counter
from redis import asyncio as aioredis
from openai import AsyncOpenAI
MODEL_PRICING = {
"gpt-4o": {"input": 0.0000025, "output": 0.00001},
"gpt-4o-mini": {"input": 0.00000015, "output": 0.0000006},
"claude-sonnet-4-6": {"input": 0.000003, "output": 0.000015},
"claude-haiku-4-5": {"input": 0.0000008, "output": 0.000004},
}
token_cost_counter = Counter(
"llm_cost_usd_total", "Cumulative LLM cost in USD",
["model", "feature", "tenant", "token_type"],
)
token_usage_counter = Counter(
"llm_tokens_total", "Cumulative token usage",
["model", "feature", "tenant", "token_type"],
)
budget_exceeded_counter = Counter(
"llm_budget_exceeded_total", "Requests blocked by budget enforcement",
["tenant", "feature", "limit_type"],
)
redis_client = aioredis.from_url(os.getenv("REDIS_URL", "redis://localhost:6379"))
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
pricing = MODEL_PRICING.get(model)
if not pricing:
return 0.0
return (input_tokens * pricing["input"]) + (output_tokens * pricing["output"])
async def record_cost(model: str, feature: str, tenant: str,
input_tokens: int, output_tokens: int) -> float:
pricing = MODEL_PRICING.get(model)
if not pricing:
return 0.0
input_cost = input_tokens * pricing["input"]
output_cost = output_tokens * pricing["output"]
total = input_cost + output_cost
token_cost_counter.labels(model=model, feature=feature, tenant=tenant, token_type="input").inc(input_cost)
token_cost_counter.labels(model=model, feature=feature, tenant=tenant, token_type="output").inc(output_cost)
token_usage_counter.labels(model=model, feature=feature, tenant=tenant, token_type="input").inc(input_tokens)
token_usage_counter.labels(model=model, feature=feature, tenant=tenant, token_type="output").inc(output_tokens)
now = datetime.now(timezone.utc)
hour_key = f"spend:{tenant}:{feature}:{now.strftime('%Y-%m-%dT%H')}"
day_key = f"spend:{tenant}:{feature}:{now.strftime('%Y-%m-%d')}"
pipe = redis_client.pipeline()
pipe.incrbyfloat(hour_key, total)
pipe.expire(hour_key, 7200)
pipe.incrbyfloat(day_key, total)
pipe.expire(day_key, 172800)
await pipe.execute()
return total
async def check_budget(tenant: str, feature: str, budgets: dict) -> tuple[bool, str | None]:
now = datetime.now(timezone.utc)
day_key = f"spend:{tenant}:{feature}:{now.strftime('%Y-%m-%d')}"
spent = float(await redis_client.get(day_key) or 0)
limits = budgets.get(tenant, budgets.get("default", {"soft": 5.0, "hard": 10.0}))
if spent >= limits["hard"]:
budget_exceeded_counter.labels(tenant=tenant, feature=feature, limit_type="hard").inc()
return False, f"Daily hard limit of ${limits['hard']} reached for tenant {tenant}"
if spent >= limits["soft"]:
budget_exceeded_counter.labels(tenant=tenant, feature=feature, limit_type="soft").inc()
print(f"[BudgetSoft] tenant={tenant} feature={feature} spent=${spent:.4f} limit=${limits['soft']}")
return True, None
C# Implementation
// CostMiddleware.cs
using Prometheus;
using StackExchange.Redis;
public class CostMiddleware
{
private static readonly Dictionary<string, (double Input, double Output)> ModelPricing = new()
{
["gpt-4o"] = (0.0000025, 0.00001),
["gpt-4o-mini"] = (0.00000015, 0.0000006),
["claude-sonnet-4-6"] = (0.000003, 0.000015),
["claude-haiku-4-5"] = (0.0000008, 0.000004),
};
private static readonly Counter TokenCostCounter = Metrics.CreateCounter(
"llm_cost_usd_total", "Cumulative LLM cost in USD",
new CounterConfiguration { LabelNames = ["model", "feature", "tenant", "token_type"] });
private static readonly Counter TokenUsageCounter = Metrics.CreateCounter(
"llm_tokens_total", "Cumulative token usage",
new CounterConfiguration { LabelNames = ["model", "feature", "tenant", "token_type"] });
private static readonly Counter BudgetExceededCounter = Metrics.CreateCounter(
"llm_budget_exceeded_total", "Requests blocked by budget enforcement",
new CounterConfiguration { LabelNames = ["tenant", "feature", "limit_type"] });
private readonly IDatabase _redis;
public CostMiddleware(IConnectionMultiplexer redis) => _redis = redis.GetDatabase();
public async Task<double> RecordCostAsync(
string model, string feature, string tenant, int inputTokens, int outputTokens)
{
if (!ModelPricing.TryGetValue(model, out var pricing)) return 0;
var inputCost = inputTokens * pricing.Input;
var outputCost = outputTokens * pricing.Output;
var total = inputCost + outputCost;
TokenCostCounter.WithLabels(model, feature, tenant, "input").Inc(inputCost);
TokenCostCounter.WithLabels(model, feature, tenant, "output").Inc(outputCost);
TokenUsageCounter.WithLabels(model, feature, tenant, "input").Inc(inputTokens);
TokenUsageCounter.WithLabels(model, feature, tenant, "output").Inc(outputTokens);
var now = DateTime.UtcNow;
var hourKey = $"spend:{tenant}:{feature}:{now:yyyy-MM-ddTHH}";
var dayKey = $"spend:{tenant}:{feature}:{now:yyyy-MM-dd}";
await _redis.StringIncrementAsync(hourKey, total);
await _redis.KeyExpireAsync(hourKey, TimeSpan.FromHours(2));
await _redis.StringIncrementAsync(dayKey, total);
await _redis.KeyExpireAsync(dayKey, TimeSpan.FromDays(2));
return total;
}
public async Task<(bool Allowed, string? Reason)> CheckBudgetAsync(
string tenant, string feature,
Dictionary<string, (double Soft, double Hard)> budgets)
{
var dayKey = $"spend:{tenant}:{feature}:{DateTime.UtcNow:yyyy-MM-dd}";
var spentStr = await _redis.StringGetAsync(dayKey);
var spent = spentStr.HasValue ? double.Parse(spentStr!) : 0;
var limits = budgets.TryGetValue(tenant, out var t) ? t
: budgets.TryGetValue("default", out var d) ? d
: (Soft: 5.0, Hard: 10.0);
if (spent >= limits.Hard)
{
BudgetExceededCounter.WithLabels(tenant, feature, "hard").Inc();
return (false, $"Daily hard limit of ${limits.Hard} reached for tenant {tenant}");
}
if (spent >= limits.Soft)
{
BudgetExceededCounter.WithLabels(tenant, feature, "soft").Inc();
Console.WriteLine($"[BudgetSoft] tenant={tenant} feature={feature} spent={spent:F4}");
}
return (true, null);
}
}
Spend Anomaly Detection
A hard budget limit stops runaway spend but does not catch gradual drift — a feature that doubles its cost week over week without triggering any single hard limit. Anomaly detection on rolling spend rate catches this pattern early.
# anomaly_detector.py
import asyncio
from redis import asyncio as aioredis
from datetime import datetime, timezone, timedelta
redis_client = aioredis.from_url("redis://localhost:6379")
async def detect_spend_anomalies(tenant: str, feature: str, baseline_days: int = 7, spike_multiplier: float = 2.5):
"""
Compare today's hourly spend rate against the rolling baseline.
Fire an alert if today's rate is more than spike_multiplier x the baseline.
"""
now = datetime.now(timezone.utc)
# Collect hourly spend for the past baseline_days
baseline_hourly = []
for day_offset in range(1, baseline_days + 1):
day = now - timedelta(days=day_offset)
for hour in range(24):
key = f"spend:{tenant}:{feature}:{day.strftime('%Y-%m-%d')}T{hour:02d}"
val = await redis_client.get(key)
if val:
baseline_hourly.append(float(val))
if not baseline_hourly:
return # Not enough history yet
avg_hourly = sum(baseline_hourly) / len(baseline_hourly)
# Current hour spend
current_key = f"spend:{tenant}:{feature}:{now.strftime('%Y-%m-%dT%H')}"
current_val = float(await redis_client.get(current_key) or 0)
if avg_hourly > 0 and current_val > avg_hourly * spike_multiplier:
alert = {
"tenant": tenant,
"feature": feature,
"current_hour_spend_usd": round(current_val, 4),
"avg_hourly_baseline_usd": round(avg_hourly, 4),
"multiplier": round(current_val / avg_hourly, 2),
"severity": "critical" if current_val > avg_hourly * 5 else "warning",
}
print(f"[CostAnomaly] {alert}")
# In production: emit to PagerDuty, Slack webhook, or Prometheus alertmanager
# Run as a background job every 5 minutes
async def run_anomaly_loop(tenants: list[str], features: list[str]):
while True:
for tenant in tenants:
for feature in features:
await detect_spend_anomalies(tenant, feature)
await asyncio.sleep(300)
The Three Cost Reduction Levers
Once you have attribution and anomaly detection in place, you have the data to act on. These three levers reduce cost without reducing capability.
Model routing by task complexity is the highest-leverage change most teams can make. Not every request needs a frontier model. A classification task that routes a support ticket to the correct department costs $0.0002 on gpt-4o-mini and $0.003 on gpt-4o — a 15x difference. Implementing a lightweight classifier that scores query complexity and routes simple requests to cheaper models can cut total spend by 40 to 60% with zero user-facing quality change. The cost data you have collected by feature now tells you which features have the most to gain.
Semantic caching is the most effective way to reduce spend on repeated or near-identical queries. Unlike exact-match caching, semantic caching uses embedding similarity to recognize that “what are your opening hours?” and “when are you open?” are the same question and returns the cached response. Production implementations report 20 to 50% reduction in token volume on high-traffic features, translating directly to a proportional cost reduction at zero quality cost since cached responses were already validated.
Prompt compression reduces input token count without changing output quality. Techniques include removing redundant whitespace and boilerplate from system prompts, compressing retrieved RAG context to the most relevant sentences rather than full document chunks, and using few-shot examples sparingly. A 30% reduction in average input token count on a high-volume feature running at $500/month saves $150/month with no other changes.
Grafana Cost Dashboard Panels
| Panel | PromQL | Alert threshold |
|---|---|---|
| Hourly cost by feature | increase(llm_cost_usd_total[1h]) | 2.5x vs 7-day baseline |
| Daily cost by tenant | increase(llm_cost_usd_total{tenant=”$tenant”}[24h]) | 80% of daily hard limit |
| Cost per 1000 requests | increase(llm_cost_usd_total[1h]) / increase(llm_requests_total[1h]) * 1000 | 50% increase vs prior week |
| Budget exceeded rate | rate(llm_budget_exceeded_total[1h]) | Any hard limit hits |
| Model cost distribution | sum by (model) (increase(llm_cost_usd_total[24h])) | Informational |
| Input vs output cost ratio | sum(llm_cost_usd_total{token_type=”output”}) / sum(llm_cost_usd_total{token_type=”input”}) | Ratio above 5 — output heavy, check max_tokens |
What Comes Next
Cost governance is now a first-class signal alongside quality and latency. In the final post of this series, Part 8, we put every piece together — tracing, metrics, evaluation, prompt management, RAG observability, and cost governance — into a complete LLMOps stack with a reference architecture and a checklist for going from zero to production-grade observability.
Key Takeaways
- Inference now accounts for 85% of enterprise AI budgets — cost governance is not optional at production scale
- Tag every LLM API call with feature, tenant, model, and prompt version — without attribution you cannot allocate or optimize cost
- Enforce soft limits with throttling and hard limits with request blocking — both should trigger alerts, not just the hard limit
- Store rolling spend in Redis for sub-millisecond budget checks before each API call
- Run spend anomaly detection on hourly rolling averages — a 2.5x spike in hourly spend rate is the right alert threshold before the bill arrives
- Model routing by task complexity is the highest-leverage cost reduction lever — a 15x price difference between frontier and mid-tier models makes even a basic complexity classifier worth building
- Semantic caching can eliminate 20 to 50% of token volume on high-traffic features at zero quality cost
- Measure cost per business outcome (cost per resolved ticket, cost per completed task) not just total token spend
References
- FinOps Foundation – “FinOps for AI Overview” (https://www.finops.org/wg/finops-for-ai-overview/)
- Analytics Week – “Inference Economics: Solving 2026 Enterprise AI Cost Crisis” (https://analyticsweek.com/inference-economics-finops-ai-roi-2026/)
- Finout – “FinOps in the Age of AI: A CPO Guide to LLM Workflows, RAG, and Agentic Systems” (https://www.finout.io/blog/finops-in-the-age-of-ai-a-cpos-guide-to-llm-workflows-rag-ai-agents-and-agentic-systems)
- OneUptime – “How to Build Cost Management for LLM Operations” (https://oneuptime.com/blog/post/2026-01-30-llmops-cost-management/view)
- Traceloop – “From Bills to Budgets: How to Track LLM Token Usage and Cost Per User” (https://www.traceloop.com/blog/from-bills-to-budgets-how-to-track-llm-token-usage-and-cost-per-user)
- LangGuard – “LLM Gateways Are the Critical Infrastructure for AI Agents in 2026” (https://langguard.ai/2026/01/13/invisible-backbone.html)
