LLM Metrics That Actually Matter: Latency, Cost, Hallucination Rate, and Drift → Explore with me!

In Part 2, we wired up distributed tracing so every step in your LLM pipeline has a traceable identity. Now we need to talk about what to measure across those traces. Because the default metrics your infrastructure hands you — request count, error rate, CPU usage — will all look perfectly normal while your LLM quietly returns wrong answers to thousands of users.

This post covers the metrics layer: what to collect, how to calculate it, and how to build dashboards in Prometheus and Grafana that surface problems before your users report them.

The Two Categories of LLM Metrics

LLM metrics split cleanly into two categories, and you need both to have meaningful production visibility.

The first category is operational metrics. These measure the technical health of your system — latency, throughput, token usage, cost, and error rates. They are fast to collect, deterministic, and work well with standard alerting. Think of these as your vital signs. They tell you when something is wrong with the infrastructure or the pipeline mechanics.

The second category is quality metrics. These measure whether your LLM is actually doing its job — hallucination rates, output relevance, faithfulness to retrieved context, and response drift over time. They are slower to compute, often probabilistic, and require either automated evaluation models or human feedback. These are what catch silent failures — situations where the infrastructure looks healthy but the output is broken.

Most teams instrument operational metrics first and add quality metrics later. That is the right order. But the gap between the two should be measured in weeks, not months.

Operational Metrics: The Foundation Layer

1. Latency — Beyond Simple Response Time

For LLMs, latency has two components that behave differently and matter for different reasons.

Time-to-first-token (TTFT) is the gap between when the request is sent and when the first token of the response arrives. For streaming applications, this is what the user feels as “responsiveness.” A TTFT above 1-2 seconds is noticeable. Above 5 seconds it degrades user experience significantly regardless of how fast the rest of the response streams.

Total generation latency is the full time from request to complete response. For non-streaming use cases this is the only number users see. For batch or backend pipelines it drives throughput calculations.

Always track both at p50, p95, and p99 percentiles. A p50 of 800ms with a p99 of 12 seconds tells a very different story than a p50 of 1.5 seconds with a p99 of 2 seconds. The tail matters because it represents your worst-case user experience and often reveals specific input patterns that stress the system.

2. Token Usage and Cost Per Request

Token cost is your LLM’s equivalent of compute cost in traditional systems, but it has a property that makes it more dangerous to ignore: it compounds silently. A single verbose system prompt, an over-eager retrieval pipeline returning 20 chunks when 5 would do, or an agent loop that runs 3 extra iterations can multiply your cost per request by 2x to 5x with no visible error signal.

Track these token metrics on every request:

Input tokens — prompt plus context. This is usually your largest cost driver
Output tokens — completion length. More variable, influenced by max_tokens settings and model behavior
Total tokens per request — input plus output
Cost per request — calculated from token counts multiplied by model pricing
Cost per feature or workflow — aggregated by tagging requests with their origin in your application

Set alerts on average cost per request and on p95 cost. A sudden jump in either usually means a prompt change inflated your context, retrieval is returning too many chunks, or an agent is looping unexpectedly.

3. Error Rate and Finish Reason Distribution

Standard error rates (HTTP 4xx, 5xx) catch provider outages and authentication failures. But LLM-specific errors are often hidden inside successful HTTP responses. Track the finish_reason field on every completion:

stop — normal completion. This is what you want
length — the model hit your max_tokens limit and was cut off. If this appears frequently, your max_tokens is too low or your prompts are too long
content_filter — the model or provider safety system blocked the response. Track this rate carefully in production; spikes often indicate prompt injection attempts or policy edge cases
tool_calls — the model stopped to call a tool. Expected in agentic workflows, but track the ratio to understand agent behavior patterns

A rising length finish reason rate is one of the most useful early warning signals in production. It means the model is consistently running out of space, which degrades output quality in ways users notice but your error rate dashboard does not.

Quality Metrics: Catching Silent Failures

4. Faithfulness and Groundedness

For RAG applications, faithfulness measures whether the model’s response is actually supported by the retrieved documents. An unfaithful response means the model made up information that was not in the context — the most common form of hallucination in grounded systems.

You cannot measure faithfulness with a simple rule. It requires either an evaluation model (LLM-as-judge, covered in depth in Part 4) or a dedicated evaluation library. The fastest approach in production is to run a lightweight evaluation model on a sample of your traffic — typically 5-10% — and track the faithfulness score distribution over time. A downward trend in average faithfulness score is your earliest signal that something has degraded in your retrieval pipeline or system prompt.

5. Output Drift Detection

Output drift is what happens when your LLM’s response distribution shifts over time without any deliberate change on your part. It can be caused by a model version update from your provider, changes in user input patterns, database content changes that affect retrieval, or subtle prompt interactions you did not anticipate.

Detecting drift without ground truth labels is the hard part. The practical approaches are:

Response length distribution tracking — a shift in average or median response length often signals a behavioral change. Easy to compute, surprisingly useful as an early warning
Embedding similarity to a baseline — embed a sample of recent responses and compare their centroid to a baseline. Distance above a threshold indicates drift
Sentiment and tone distribution — if your application has a defined tone (professional, friendly, concise), track a simple tone classifier over time
Refusal rate tracking — track the percentage of responses that are refusals or non-answers. A spike often means the model provider updated safety policies

The Metrics Architecture

The diagram below shows how operational and quality metrics flow from your LLM pipeline through to dashboards and alerts.

flowchart TD
    A[LLM Pipeline Request] --> B[OpenTelemetry Instrumentation]

    B --> C[Operational Metrics]
    B --> D[Quality Metrics]

    C --> C1[TTFT / Total Latency\np50 p95 p99]
    C --> C2[Token Usage\nInput / Output / Total]
    C --> C3[Cost Per Request\nBy model / feature / env]
    C --> C4[Finish Reason Distribution\nstop / length / content_filter]
    C --> C5[Error Rate\nHTTP + provider errors]

    D --> D1[Faithfulness Score\nSampled eval pipeline]
    D --> D2[Relevance Score\nQuery vs response]
    D --> D3[Response Length Drift\nDistribution shift]
    D --> D4[Refusal Rate\nNon-answer detection]

    C1 --> E[Prometheus\nMetrics Store]
    C2 --> E
    C3 --> E
    C4 --> E
    C5 --> E

    D1 --> F[Eval Backend\nLangfuse / Arize]
    D2 --> F
    D3 --> F
    D4 --> F

    E --> G[Grafana Dashboards]
    F --> G

    G --> H[Alerts\nPagerDuty / Slack]
    G --> I[Cost Reports\nWeekly Budget Review]
    G --> J[Quality Trends\nWeekly Model Review]

    style E fill:#1e3a5f,color:#ffffff
    style F fill:#1e3a5f,color:#ffffff
    style G fill:#2d4a1e,color:#ffffff

Implementation: Custom Prometheus Metrics in Node.js, Python, and C#

Node.js

npm install prom-client openai

// metrics.js
const promClient = require('prom-client');

// Register default Node.js metrics
promClient.collectDefaultMetrics({ prefix: 'llm_app_' });

// Latency histograms
const ttftHistogram = new promClient.Histogram({
  name: 'llm_time_to_first_token_ms',
  help: 'Time to first token in milliseconds',
  labelNames: ['model', 'feature', 'environment'],
  buckets: [100, 250, 500, 1000, 2000, 5000, 10000],
});

const totalLatencyHistogram = new promClient.Histogram({
  name: 'llm_total_latency_ms',
  help: 'Total LLM response latency in milliseconds',
  labelNames: ['model', 'feature', 'environment'],
  buckets: [500, 1000, 2000, 5000, 10000, 30000],
});

// Token and cost counters
const inputTokensCounter = new promClient.Counter({
  name: 'llm_input_tokens_total',
  help: 'Total input tokens consumed',
  labelNames: ['model', 'feature'],
});

const outputTokensCounter = new promClient.Counter({
  name: 'llm_output_tokens_total',
  help: 'Total output tokens generated',
  labelNames: ['model', 'feature'],
});

const costCounter = new promClient.Counter({
  name: 'llm_cost_usd_total',
  help: 'Total LLM cost in USD',
  labelNames: ['model', 'feature'],
});

// Finish reason counter
const finishReasonCounter = new promClient.Counter({
  name: 'llm_finish_reason_total',
  help: 'LLM finish reason distribution',
  labelNames: ['model', 'finish_reason'],
});

// Quality score histogram
const faithfulnessHistogram = new promClient.Histogram({
  name: 'llm_faithfulness_score',
  help: 'Faithfulness score from evaluation pipeline (0-1)',
  labelNames: ['model', 'feature'],
  buckets: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
});

// Token cost calculation (update pricing per model as needed)
const MODEL_PRICING = {
  'gpt-4o': { input: 0.0000025, output: 0.00001 },
  'gpt-4o-mini': { input: 0.00000015, output: 0.0000006 },
  'claude-sonnet-4-6': { input: 0.000003, output: 0.000015 },
};

function calculateCost(model, inputTokens, outputTokens) {
  const pricing = MODEL_PRICING[model] || { input: 0, output: 0 };
  return (inputTokens * pricing.input) + (outputTokens * pricing.output);
}

async function recordLLMCall({ model, feature, inputTokens, outputTokens, ttftMs, totalMs, finishReason }) {
  const labels = { model, feature, environment: process.env.NODE_ENV || 'production' };

  ttftHistogram.observe(labels, ttftMs);
  totalLatencyHistogram.observe(labels, totalMs);
  inputTokensCounter.inc({ model, feature }, inputTokens);
  outputTokensCounter.inc({ model, feature }, outputTokens);
  costCounter.inc({ model, feature }, calculateCost(model, inputTokens, outputTokens));
  finishReasonCounter.inc({ model, finish_reason: finishReason });
}

function recordFaithfulnessScore(model, feature, score) {
  faithfulnessHistogram.observe({ model, feature }, score);
}

module.exports = {
  registry: promClient.register,
  recordLLMCall,
  recordFaithfulnessScore,
};

Expose the metrics endpoint in your Express app:

// app.js
const express = require('express');
const { registry, recordLLMCall } = require('./metrics');
const app = express();

// Prometheus scrape endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.end(await registry.metrics());
});

// Example: wrapping an LLM call with metric recording
app.post('/api/chat', async (req, res) => {
  const start = Date.now();
  let ttft = null;

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    stream: true,
    messages: [{ role: 'user', content: req.body.message }],
  });

  let fullResponse = '';
  for await (const chunk of stream) {
    if (!ttft && chunk.choices[0]?.delta?.content) {
      ttft = Date.now() - start; // capture TTFT on first content chunk
    }
    fullResponse += chunk.choices[0]?.delta?.content || '';
  }

  const totalMs = Date.now() - start;

  await recordLLMCall({
    model: 'gpt-4o',
    feature: 'chat',
    inputTokens: stream.usage?.prompt_tokens || 0,
    outputTokens: stream.usage?.completion_tokens || 0,
    ttftMs: ttft || totalMs,
    totalMs,
    finishReason: stream.choices?.[0]?.finish_reason || 'unknown',
  });

  res.json({ response: fullResponse });
});

Python

pip install prometheus-client openai

# metrics.py
import os
import time
from prometheus_client import Counter, Histogram, CollectorRegistry, generate_latest, CONTENT_TYPE_LATEST

registry = CollectorRegistry()

ttft_histogram = Histogram(
    'llm_time_to_first_token_ms',
    'Time to first token in milliseconds',
    ['model', 'feature', 'environment'],
    buckets=[100, 250, 500, 1000, 2000, 5000, 10000],
    registry=registry,
)

total_latency_histogram = Histogram(
    'llm_total_latency_ms',
    'Total LLM response latency in milliseconds',
    ['model', 'feature', 'environment'],
    buckets=[500, 1000, 2000, 5000, 10000, 30000],
    registry=registry,
)

input_tokens_counter = Counter(
    'llm_input_tokens_total',
    'Total input tokens consumed',
    ['model', 'feature'],
    registry=registry,
)

output_tokens_counter = Counter(
    'llm_output_tokens_total',
    'Total output tokens generated',
    ['model', 'feature'],
    registry=registry,
)

cost_counter = Counter(
    'llm_cost_usd_total',
    'Total LLM cost in USD',
    ['model', 'feature'],
    registry=registry,
)

finish_reason_counter = Counter(
    'llm_finish_reason_total',
    'LLM finish reason distribution',
    ['model', 'finish_reason'],
    registry=registry,
)

faithfulness_histogram = Histogram(
    'llm_faithfulness_score',
    'Faithfulness score from evaluation pipeline (0-1)',
    ['model', 'feature'],
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    registry=registry,
)

MODEL_PRICING = {
    'gpt-4o': {'input': 0.0000025, 'output': 0.00001},
    'gpt-4o-mini': {'input': 0.00000015, 'output': 0.0000006},
}

def record_llm_call(model: str, feature: str, input_tokens: int, output_tokens: int,
                    ttft_ms: float, total_ms: float, finish_reason: str):
    env = os.getenv('ENVIRONMENT', 'production')
    labels = {'model': model, 'feature': feature, 'environment': env}

    ttft_histogram.labels(**labels).observe(ttft_ms)
    total_latency_histogram.labels(**labels).observe(total_ms)
    input_tokens_counter.labels(model=model, feature=feature).inc(input_tokens)
    output_tokens_counter.labels(model=model, feature=feature).inc(output_tokens)

    pricing = MODEL_PRICING.get(model, {'input': 0, 'output': 0})
    cost = (input_tokens * pricing['input']) + (output_tokens * pricing['output'])
    cost_counter.labels(model=model, feature=feature).inc(cost)
    finish_reason_counter.labels(model=model, finish_reason=finish_reason).inc()

def record_faithfulness_score(model: str, feature: str, score: float):
    faithfulness_histogram.labels(model=model, feature=feature).observe(score)

C#

dotnet add package prometheus-net
dotnet add package prometheus-net.AspNetCore

// LlmMetrics.cs
using Prometheus;

public static class LlmMetrics
{
    private static readonly string[] LabelNames = ["model", "feature", "environment"];

    public static readonly Histogram TtftHistogram = Metrics.CreateHistogram(
        "llm_time_to_first_token_ms",
        "Time to first token in milliseconds",
        new HistogramConfiguration
        {
            LabelNames = LabelNames,
            Buckets = [100, 250, 500, 1000, 2000, 5000, 10000]
        });

    public static readonly Histogram TotalLatencyHistogram = Metrics.CreateHistogram(
        "llm_total_latency_ms",
        "Total LLM response latency in milliseconds",
        new HistogramConfiguration
        {
            LabelNames = LabelNames,
            Buckets = [500, 1000, 2000, 5000, 10000, 30000]
        });

    public static readonly Counter InputTokensCounter = Metrics.CreateCounter(
        "llm_input_tokens_total",
        "Total input tokens consumed",
        new CounterConfiguration { LabelNames = ["model", "feature"] });

    public static readonly Counter OutputTokensCounter = Metrics.CreateCounter(
        "llm_output_tokens_total",
        "Total output tokens generated",
        new CounterConfiguration { LabelNames = ["model", "feature"] });

    public static readonly Counter CostCounter = Metrics.CreateCounter(
        "llm_cost_usd_total",
        "Total LLM cost in USD",
        new CounterConfiguration { LabelNames = ["model", "feature"] });

    public static readonly Counter FinishReasonCounter = Metrics.CreateCounter(
        "llm_finish_reason_total",
        "LLM finish reason distribution",
        new CounterConfiguration { LabelNames = ["model", "finish_reason"] });

    public static readonly Histogram FaithfulnessHistogram = Metrics.CreateHistogram(
        "llm_faithfulness_score",
        "Faithfulness score from evaluation pipeline (0-1)",
        new HistogramConfiguration
        {
            LabelNames = ["model", "feature"],
            Buckets = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
        });

    private static readonly Dictionary<string, (double Input, double Output)> ModelPricing = new()
    {
        ["gpt-4o"] = (0.0000025, 0.00001),
        ["gpt-4o-mini"] = (0.00000015, 0.0000006),
    };

    public static void RecordLlmCall(string model, string feature,
        long inputTokens, long outputTokens,
        double ttftMs, double totalMs, string finishReason)
    {
        var env = Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT") ?? "Production";
        var labelValues = new[] { model, feature, env };

        TtftHistogram.WithLabels(labelValues).Observe(ttftMs);
        TotalLatencyHistogram.WithLabels(labelValues).Observe(totalMs);
        InputTokensCounter.WithLabels(model, feature).Inc(inputTokens);
        OutputTokensCounter.WithLabels(model, feature).Inc(outputTokens);

        if (ModelPricing.TryGetValue(model, out var pricing))
        {
            var cost = (inputTokens * pricing.Input) + (outputTokens * pricing.Output);
            CostCounter.WithLabels(model, feature).Inc(cost);
        }

        FinishReasonCounter.WithLabels(model, finishReason).Inc();
    }
}

// Program.cs - add to middleware pipeline
app.UseRouting();
app.UseHttpMetrics(); // auto-instruments HTTP request metrics
app.MapMetrics();     // exposes /metrics endpoint for Prometheus scraping

Grafana Dashboard: Essential Panels

Once Prometheus is scraping your metrics endpoint, create a Grafana dashboard with these panels as your baseline. These are the minimum panels that give you meaningful production visibility.

Panel	Query	Alert Threshold
TTFT p95	`histogram_quantile(0.95, rate(llm_time_to_first_token_ms_bucket[5m]))`	> 3000ms
Total latency p99	`histogram_quantile(0.99, rate(llm_total_latency_ms_bucket[5m]))`	> 15000ms
Cost per hour	`increase(llm_cost_usd_total[1h])`	> budget threshold
Tokens per request (avg)	`rate(llm_input_tokens_total[5m]) / rate(llm_finish_reason_total[5m])`	> 2x baseline
Length finish rate	`rate(llm_finish_reason_total{finish_reason="length"}[5m]) / rate(llm_finish_reason_total[5m])`	> 5%
Faithfulness p10	`histogram_quantile(0.10, rate(llm_faithfulness_score_bucket[30m]))`	< 0.6

Alert Runbooks: What to Do When Metrics Spike

Alerts without runbooks create on-call fatigue. Here are the first steps to take for each of the key alerts above.

TTFT p95 spike: Check if the provider status page shows degradation. Then look at whether input token count increased around the same time — longer prompts push TTFT up. If neither, check your retrieval layer latency using the traces from Part 2.

Cost per hour spike: Pull the cost breakdown by feature label first. Isolate which feature is driving the increase. Then check that feature’s average input token count — a prompt change or retrieval configuration change is usually the cause.

Length finish rate above 5%: The model is being cut off before it finishes. Increase max_tokens on the affected feature, or investigate whether recent prompt changes added too much context and are consuming the token budget before the response is complete.

Faithfulness score degradation: This almost always points to the retrieval layer. Check whether the embedding model or vector index was recently changed. Run a manual test with a known query and inspect what documents are being retrieved. Part 6 of this series covers RAG pipeline observability in depth.

What Comes Next

You now have a full operational and quality metrics layer feeding into dashboards and alerts. In Part 4, we go deeper into the quality side — building a proper LLM-as-judge evaluation pipeline that runs continuously in production and scores your outputs for faithfulness, relevance, and safety at scale.

Key Takeaways

LLM metrics split into operational (latency, tokens, cost, error rates) and quality (faithfulness, drift, refusal rate) — you need both
Track TTFT separately from total latency — they reveal different problems and matter to different stakeholders
Always track finish reason distribution — a rising length rate is an early warning for output quality degradation
Cost per request must be a first-class metric with alerting — token costs compound silently and invisibly
Response length distribution is the cheapest drift detector available — measure it from day one
Faithfulness score at the p10 percentile is more useful than the average — the tail of low-quality responses is where your worst user experiences live

LLM Metrics That Actually Matter: Latency, Cost, Hallucination Rate, and Drift

The Two Categories of LLM Metrics

Operational Metrics: The Foundation Layer

1. Latency — Beyond Simple Response Time

2. Token Usage and Cost Per Request

3. Error Rate and Finish Reason Distribution

Quality Metrics: Catching Silent Failures

4. Faithfulness and Groundedness

5. Output Drift Detection

The Metrics Architecture

Implementation: Custom Prometheus Metrics in Node.js, Python, and C#

Node.js

Python

C#

Grafana Dashboard: Essential Panels

Alert Runbooks: What to Do When Metrics Spike

What Comes Next

Key Takeaways

References

Like this:

You may like

Written by:

Chandan 595 Posts

You May Have Missed

The Complete Picture: Balancing Professional and Personal Support Systems

For Parents, Partners, and Friends: A Guide to Supporting Your Loved One in Tech

The HR Conversation: When and How to Involve HR in Your Mental Health Journey

Finding Your Tech Tribe: The Power of Peer Support Groups

How to whitelist website on AdBlocker?

The Two Categories of LLM Metrics

Operational Metrics: The Foundation Layer

1. Latency — Beyond Simple Response Time

2. Token Usage and Cost Per Request

3. Error Rate and Finish Reason Distribution

Quality Metrics: Catching Silent Failures

4. Faithfulness and Groundedness

5. Output Drift Detection

The Metrics Architecture

Implementation: Custom Prometheus Metrics in Node.js, Python, and C#

Node.js

Python

C#

Grafana Dashboard: Essential Panels

Alert Runbooks: What to Do When Metrics Spike

What Comes Next

Key Takeaways

References

Like this:

You may like

Written by:

Chandan 595 Posts

Related Posts

Distributed Tracing for LLM Applications with OpenTelemetry

Why LLMOps Is Not MLOps: The New Operational Reality for AI Teams

OpenClaw Complete Guide Part 8: Integrating OpenClaw with Your Development Stack

You May Have Missed

The Complete Picture: Balancing Professional and Personal Support Systems

For Parents, Partners, and Friends: A Guide to Supporting Your Loved One in Tech

The HR Conversation: When and How to Involve HR in Your Mental Health Journey

Finding Your Tech Tribe: The Power of Peer Support Groups

How to whitelist website on AdBlocker?