In Part 2, we wired up distributed tracing so every step in your LLM pipeline has a traceable identity. Now we need to talk about what to measure across those traces. Because the default metrics your infrastructure hands you — request count, error rate, CPU usage — will all look perfectly normal while your LLM quietly returns wrong answers to thousands of users.
This post covers the metrics layer: what to collect, how to calculate it, and how to build dashboards in Prometheus and Grafana that surface problems before your users report them.
The Two Categories of LLM Metrics
LLM metrics split cleanly into two categories, and you need both to have meaningful production visibility.
The first category is operational metrics. These measure the technical health of your system — latency, throughput, token usage, cost, and error rates. They are fast to collect, deterministic, and work well with standard alerting. Think of these as your vital signs. They tell you when something is wrong with the infrastructure or the pipeline mechanics.
The second category is quality metrics. These measure whether your LLM is actually doing its job — hallucination rates, output relevance, faithfulness to retrieved context, and response drift over time. They are slower to compute, often probabilistic, and require either automated evaluation models or human feedback. These are what catch silent failures — situations where the infrastructure looks healthy but the output is broken.
Most teams instrument operational metrics first and add quality metrics later. That is the right order. But the gap between the two should be measured in weeks, not months.
Operational Metrics: The Foundation Layer
1. Latency — Beyond Simple Response Time
For LLMs, latency has two components that behave differently and matter for different reasons.
Time-to-first-token (TTFT) is the gap between when the request is sent and when the first token of the response arrives. For streaming applications, this is what the user feels as “responsiveness.” A TTFT above 1-2 seconds is noticeable. Above 5 seconds it degrades user experience significantly regardless of how fast the rest of the response streams.
Total generation latency is the full time from request to complete response. For non-streaming use cases this is the only number users see. For batch or backend pipelines it drives throughput calculations.
Always track both at p50, p95, and p99 percentiles. A p50 of 800ms with a p99 of 12 seconds tells a very different story than a p50 of 1.5 seconds with a p99 of 2 seconds. The tail matters because it represents your worst-case user experience and often reveals specific input patterns that stress the system.
2. Token Usage and Cost Per Request
Token cost is your LLM’s equivalent of compute cost in traditional systems, but it has a property that makes it more dangerous to ignore: it compounds silently. A single verbose system prompt, an over-eager retrieval pipeline returning 20 chunks when 5 would do, or an agent loop that runs 3 extra iterations can multiply your cost per request by 2x to 5x with no visible error signal.
Track these token metrics on every request:
- Input tokens — prompt plus context. This is usually your largest cost driver
- Output tokens — completion length. More variable, influenced by max_tokens settings and model behavior
- Total tokens per request — input plus output
- Cost per request — calculated from token counts multiplied by model pricing
- Cost per feature or workflow — aggregated by tagging requests with their origin in your application
Set alerts on average cost per request and on p95 cost. A sudden jump in either usually means a prompt change inflated your context, retrieval is returning too many chunks, or an agent is looping unexpectedly.
3. Error Rate and Finish Reason Distribution
Standard error rates (HTTP 4xx, 5xx) catch provider outages and authentication failures. But LLM-specific errors are often hidden inside successful HTTP responses. Track the finish_reason field on every completion:
- stop — normal completion. This is what you want
- length — the model hit your
max_tokenslimit and was cut off. If this appears frequently, your max_tokens is too low or your prompts are too long - content_filter — the model or provider safety system blocked the response. Track this rate carefully in production; spikes often indicate prompt injection attempts or policy edge cases
- tool_calls — the model stopped to call a tool. Expected in agentic workflows, but track the ratio to understand agent behavior patterns
A rising length finish reason rate is one of the most useful early warning signals in production. It means the model is consistently running out of space, which degrades output quality in ways users notice but your error rate dashboard does not.
Quality Metrics: Catching Silent Failures
4. Faithfulness and Groundedness
For RAG applications, faithfulness measures whether the model’s response is actually supported by the retrieved documents. An unfaithful response means the model made up information that was not in the context — the most common form of hallucination in grounded systems.
You cannot measure faithfulness with a simple rule. It requires either an evaluation model (LLM-as-judge, covered in depth in Part 4) or a dedicated evaluation library. The fastest approach in production is to run a lightweight evaluation model on a sample of your traffic — typically 5-10% — and track the faithfulness score distribution over time. A downward trend in average faithfulness score is your earliest signal that something has degraded in your retrieval pipeline or system prompt.
5. Output Drift Detection
Output drift is what happens when your LLM’s response distribution shifts over time without any deliberate change on your part. It can be caused by a model version update from your provider, changes in user input patterns, database content changes that affect retrieval, or subtle prompt interactions you did not anticipate.
Detecting drift without ground truth labels is the hard part. The practical approaches are:
- Response length distribution tracking — a shift in average or median response length often signals a behavioral change. Easy to compute, surprisingly useful as an early warning
- Embedding similarity to a baseline — embed a sample of recent responses and compare their centroid to a baseline. Distance above a threshold indicates drift
- Sentiment and tone distribution — if your application has a defined tone (professional, friendly, concise), track a simple tone classifier over time
- Refusal rate tracking — track the percentage of responses that are refusals or non-answers. A spike often means the model provider updated safety policies
The Metrics Architecture
The diagram below shows how operational and quality metrics flow from your LLM pipeline through to dashboards and alerts.
flowchart TD
A[LLM Pipeline Request] --> B[OpenTelemetry Instrumentation]
B --> C[Operational Metrics]
B --> D[Quality Metrics]
C --> C1[TTFT / Total Latency\np50 p95 p99]
C --> C2[Token Usage\nInput / Output / Total]
C --> C3[Cost Per Request\nBy model / feature / env]
C --> C4[Finish Reason Distribution\nstop / length / content_filter]
C --> C5[Error Rate\nHTTP + provider errors]
D --> D1[Faithfulness Score\nSampled eval pipeline]
D --> D2[Relevance Score\nQuery vs response]
D --> D3[Response Length Drift\nDistribution shift]
D --> D4[Refusal Rate\nNon-answer detection]
C1 --> E[Prometheus\nMetrics Store]
C2 --> E
C3 --> E
C4 --> E
C5 --> E
D1 --> F[Eval Backend\nLangfuse / Arize]
D2 --> F
D3 --> F
D4 --> F
E --> G[Grafana Dashboards]
F --> G
G --> H[Alerts\nPagerDuty / Slack]
G --> I[Cost Reports\nWeekly Budget Review]
G --> J[Quality Trends\nWeekly Model Review]
style E fill:#1e3a5f,color:#ffffff
style F fill:#1e3a5f,color:#ffffff
style G fill:#2d4a1e,color:#ffffff
Implementation: Custom Prometheus Metrics in Node.js, Python, and C#
Node.js
npm install prom-client openai// metrics.js
const promClient = require('prom-client');
// Register default Node.js metrics
promClient.collectDefaultMetrics({ prefix: 'llm_app_' });
// Latency histograms
const ttftHistogram = new promClient.Histogram({
name: 'llm_time_to_first_token_ms',
help: 'Time to first token in milliseconds',
labelNames: ['model', 'feature', 'environment'],
buckets: [100, 250, 500, 1000, 2000, 5000, 10000],
});
const totalLatencyHistogram = new promClient.Histogram({
name: 'llm_total_latency_ms',
help: 'Total LLM response latency in milliseconds',
labelNames: ['model', 'feature', 'environment'],
buckets: [500, 1000, 2000, 5000, 10000, 30000],
});
// Token and cost counters
const inputTokensCounter = new promClient.Counter({
name: 'llm_input_tokens_total',
help: 'Total input tokens consumed',
labelNames: ['model', 'feature'],
});
const outputTokensCounter = new promClient.Counter({
name: 'llm_output_tokens_total',
help: 'Total output tokens generated',
labelNames: ['model', 'feature'],
});
const costCounter = new promClient.Counter({
name: 'llm_cost_usd_total',
help: 'Total LLM cost in USD',
labelNames: ['model', 'feature'],
});
// Finish reason counter
const finishReasonCounter = new promClient.Counter({
name: 'llm_finish_reason_total',
help: 'LLM finish reason distribution',
labelNames: ['model', 'finish_reason'],
});
// Quality score histogram
const faithfulnessHistogram = new promClient.Histogram({
name: 'llm_faithfulness_score',
help: 'Faithfulness score from evaluation pipeline (0-1)',
labelNames: ['model', 'feature'],
buckets: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
});
// Token cost calculation (update pricing per model as needed)
const MODEL_PRICING = {
'gpt-4o': { input: 0.0000025, output: 0.00001 },
'gpt-4o-mini': { input: 0.00000015, output: 0.0000006 },
'claude-sonnet-4-6': { input: 0.000003, output: 0.000015 },
};
function calculateCost(model, inputTokens, outputTokens) {
const pricing = MODEL_PRICING[model] || { input: 0, output: 0 };
return (inputTokens * pricing.input) + (outputTokens * pricing.output);
}
async function recordLLMCall({ model, feature, inputTokens, outputTokens, ttftMs, totalMs, finishReason }) {
const labels = { model, feature, environment: process.env.NODE_ENV || 'production' };
ttftHistogram.observe(labels, ttftMs);
totalLatencyHistogram.observe(labels, totalMs);
inputTokensCounter.inc({ model, feature }, inputTokens);
outputTokensCounter.inc({ model, feature }, outputTokens);
costCounter.inc({ model, feature }, calculateCost(model, inputTokens, outputTokens));
finishReasonCounter.inc({ model, finish_reason: finishReason });
}
function recordFaithfulnessScore(model, feature, score) {
faithfulnessHistogram.observe({ model, feature }, score);
}
module.exports = {
registry: promClient.register,
recordLLMCall,
recordFaithfulnessScore,
};
Expose the metrics endpoint in your Express app:
// app.js
const express = require('express');
const { registry, recordLLMCall } = require('./metrics');
const app = express();
// Prometheus scrape endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', registry.contentType);
res.end(await registry.metrics());
});
// Example: wrapping an LLM call with metric recording
app.post('/api/chat', async (req, res) => {
const start = Date.now();
let ttft = null;
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
stream: true,
messages: [{ role: 'user', content: req.body.message }],
});
let fullResponse = '';
for await (const chunk of stream) {
if (!ttft && chunk.choices[0]?.delta?.content) {
ttft = Date.now() - start; // capture TTFT on first content chunk
}
fullResponse += chunk.choices[0]?.delta?.content || '';
}
const totalMs = Date.now() - start;
await recordLLMCall({
model: 'gpt-4o',
feature: 'chat',
inputTokens: stream.usage?.prompt_tokens || 0,
outputTokens: stream.usage?.completion_tokens || 0,
ttftMs: ttft || totalMs,
totalMs,
finishReason: stream.choices?.[0]?.finish_reason || 'unknown',
});
res.json({ response: fullResponse });
});
Python
pip install prometheus-client openai# metrics.py
import os
import time
from prometheus_client import Counter, Histogram, CollectorRegistry, generate_latest, CONTENT_TYPE_LATEST
registry = CollectorRegistry()
ttft_histogram = Histogram(
'llm_time_to_first_token_ms',
'Time to first token in milliseconds',
['model', 'feature', 'environment'],
buckets=[100, 250, 500, 1000, 2000, 5000, 10000],
registry=registry,
)
total_latency_histogram = Histogram(
'llm_total_latency_ms',
'Total LLM response latency in milliseconds',
['model', 'feature', 'environment'],
buckets=[500, 1000, 2000, 5000, 10000, 30000],
registry=registry,
)
input_tokens_counter = Counter(
'llm_input_tokens_total',
'Total input tokens consumed',
['model', 'feature'],
registry=registry,
)
output_tokens_counter = Counter(
'llm_output_tokens_total',
'Total output tokens generated',
['model', 'feature'],
registry=registry,
)
cost_counter = Counter(
'llm_cost_usd_total',
'Total LLM cost in USD',
['model', 'feature'],
registry=registry,
)
finish_reason_counter = Counter(
'llm_finish_reason_total',
'LLM finish reason distribution',
['model', 'finish_reason'],
registry=registry,
)
faithfulness_histogram = Histogram(
'llm_faithfulness_score',
'Faithfulness score from evaluation pipeline (0-1)',
['model', 'feature'],
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
registry=registry,
)
MODEL_PRICING = {
'gpt-4o': {'input': 0.0000025, 'output': 0.00001},
'gpt-4o-mini': {'input': 0.00000015, 'output': 0.0000006},
}
def record_llm_call(model: str, feature: str, input_tokens: int, output_tokens: int,
ttft_ms: float, total_ms: float, finish_reason: str):
env = os.getenv('ENVIRONMENT', 'production')
labels = {'model': model, 'feature': feature, 'environment': env}
ttft_histogram.labels(**labels).observe(ttft_ms)
total_latency_histogram.labels(**labels).observe(total_ms)
input_tokens_counter.labels(model=model, feature=feature).inc(input_tokens)
output_tokens_counter.labels(model=model, feature=feature).inc(output_tokens)
pricing = MODEL_PRICING.get(model, {'input': 0, 'output': 0})
cost = (input_tokens * pricing['input']) + (output_tokens * pricing['output'])
cost_counter.labels(model=model, feature=feature).inc(cost)
finish_reason_counter.labels(model=model, finish_reason=finish_reason).inc()
def record_faithfulness_score(model: str, feature: str, score: float):
faithfulness_histogram.labels(model=model, feature=feature).observe(score)
C#
dotnet add package prometheus-net
dotnet add package prometheus-net.AspNetCore// LlmMetrics.cs
using Prometheus;
public static class LlmMetrics
{
private static readonly string[] LabelNames = ["model", "feature", "environment"];
public static readonly Histogram TtftHistogram = Metrics.CreateHistogram(
"llm_time_to_first_token_ms",
"Time to first token in milliseconds",
new HistogramConfiguration
{
LabelNames = LabelNames,
Buckets = [100, 250, 500, 1000, 2000, 5000, 10000]
});
public static readonly Histogram TotalLatencyHistogram = Metrics.CreateHistogram(
"llm_total_latency_ms",
"Total LLM response latency in milliseconds",
new HistogramConfiguration
{
LabelNames = LabelNames,
Buckets = [500, 1000, 2000, 5000, 10000, 30000]
});
public static readonly Counter InputTokensCounter = Metrics.CreateCounter(
"llm_input_tokens_total",
"Total input tokens consumed",
new CounterConfiguration { LabelNames = ["model", "feature"] });
public static readonly Counter OutputTokensCounter = Metrics.CreateCounter(
"llm_output_tokens_total",
"Total output tokens generated",
new CounterConfiguration { LabelNames = ["model", "feature"] });
public static readonly Counter CostCounter = Metrics.CreateCounter(
"llm_cost_usd_total",
"Total LLM cost in USD",
new CounterConfiguration { LabelNames = ["model", "feature"] });
public static readonly Counter FinishReasonCounter = Metrics.CreateCounter(
"llm_finish_reason_total",
"LLM finish reason distribution",
new CounterConfiguration { LabelNames = ["model", "finish_reason"] });
public static readonly Histogram FaithfulnessHistogram = Metrics.CreateHistogram(
"llm_faithfulness_score",
"Faithfulness score from evaluation pipeline (0-1)",
new HistogramConfiguration
{
LabelNames = ["model", "feature"],
Buckets = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
});
private static readonly Dictionary<string, (double Input, double Output)> ModelPricing = new()
{
["gpt-4o"] = (0.0000025, 0.00001),
["gpt-4o-mini"] = (0.00000015, 0.0000006),
};
public static void RecordLlmCall(string model, string feature,
long inputTokens, long outputTokens,
double ttftMs, double totalMs, string finishReason)
{
var env = Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT") ?? "Production";
var labelValues = new[] { model, feature, env };
TtftHistogram.WithLabels(labelValues).Observe(ttftMs);
TotalLatencyHistogram.WithLabels(labelValues).Observe(totalMs);
InputTokensCounter.WithLabels(model, feature).Inc(inputTokens);
OutputTokensCounter.WithLabels(model, feature).Inc(outputTokens);
if (ModelPricing.TryGetValue(model, out var pricing))
{
var cost = (inputTokens * pricing.Input) + (outputTokens * pricing.Output);
CostCounter.WithLabels(model, feature).Inc(cost);
}
FinishReasonCounter.WithLabels(model, finishReason).Inc();
}
}Register the Prometheus endpoint in Program.cs:
// Program.cs - add to middleware pipeline
app.UseRouting();
app.UseHttpMetrics(); // auto-instruments HTTP request metrics
app.MapMetrics(); // exposes /metrics endpoint for Prometheus scrapingGrafana Dashboard: Essential Panels
Once Prometheus is scraping your metrics endpoint, create a Grafana dashboard with these panels as your baseline. These are the minimum panels that give you meaningful production visibility.
| Panel | Query | Alert Threshold |
|---|---|---|
| TTFT p95 | histogram_quantile(0.95, rate(llm_time_to_first_token_ms_bucket[5m])) | > 3000ms |
| Total latency p99 | histogram_quantile(0.99, rate(llm_total_latency_ms_bucket[5m])) | > 15000ms |
| Cost per hour | increase(llm_cost_usd_total[1h]) | > budget threshold |
| Tokens per request (avg) | rate(llm_input_tokens_total[5m]) / rate(llm_finish_reason_total[5m]) | > 2x baseline |
| Length finish rate | rate(llm_finish_reason_total{finish_reason="length"}[5m]) / rate(llm_finish_reason_total[5m]) | > 5% |
| Faithfulness p10 | histogram_quantile(0.10, rate(llm_faithfulness_score_bucket[30m])) | < 0.6 |
Alert Runbooks: What to Do When Metrics Spike
Alerts without runbooks create on-call fatigue. Here are the first steps to take for each of the key alerts above.
TTFT p95 spike: Check if the provider status page shows degradation. Then look at whether input token count increased around the same time — longer prompts push TTFT up. If neither, check your retrieval layer latency using the traces from Part 2.
Cost per hour spike: Pull the cost breakdown by feature label first. Isolate which feature is driving the increase. Then check that feature’s average input token count — a prompt change or retrieval configuration change is usually the cause.
Length finish rate above 5%: The model is being cut off before it finishes. Increase max_tokens on the affected feature, or investigate whether recent prompt changes added too much context and are consuming the token budget before the response is complete.
Faithfulness score degradation: This almost always points to the retrieval layer. Check whether the embedding model or vector index was recently changed. Run a manual test with a known query and inspect what documents are being retrieved. Part 6 of this series covers RAG pipeline observability in depth.
What Comes Next
You now have a full operational and quality metrics layer feeding into dashboards and alerts. In Part 4, we go deeper into the quality side — building a proper LLM-as-judge evaluation pipeline that runs continuously in production and scores your outputs for faithfulness, relevance, and safety at scale.
Key Takeaways
- LLM metrics split into operational (latency, tokens, cost, error rates) and quality (faithfulness, drift, refusal rate) — you need both
- Track TTFT separately from total latency — they reveal different problems and matter to different stakeholders
- Always track finish reason distribution — a rising
lengthrate is an early warning for output quality degradation - Cost per request must be a first-class metric with alerting — token costs compound silently and invisibly
- Response length distribution is the cheapest drift detector available — measure it from day one
- Faithfulness score at the p10 percentile is more useful than the average — the tail of low-quality responses is where your worst user experiences live
References
- Splunk – “LLM Observability Explained: Prevent Hallucinations, Manage Drift, Control Costs” (https://www.splunk.com/en_us/blog/learn/llm-observability.html)
- DasRoot – “Observability for LLM Systems: Metrics That Matter” (https://dasroot.net/posts/2026/02/llm-observability-metrics-tracing/)
- Vellum – “A Guide to LLM Observability” (https://www.vellum.ai/blog/a-guide-to-llm-observability)
- Awesome Agents – “Best LLM Observability Tools in 2026” (https://awesomeagents.ai/tools/best-llm-observability-tools-2026/)
- AI Multiple – “AI Hallucination: Compare Top LLMs” (https://research.aimultiple.com/ai-hallucination/)
- Firecrawl – “Best LLM Observability Tools in 2026” (https://www.firecrawl.dev/blog/best-llm-observability-tools)
