Evaluating LLM Output Quality in Production: LLM-as-Judge and Human Feedback Loops

Evaluating LLM Output Quality in Production: LLM-as-Judge and Human Feedback Loops

Parts 2 and 3 of this series gave you distributed tracing and a metrics layer. Those two things together tell you a lot: where time is spent, what responses cost, whether the model is being cut off, whether latency is drifting. But none of that tells you whether the response was actually any good.

That is the job of evaluation. And in production, evaluation cannot be a quarterly manual review or a pre-launch QA pass. It needs to run continuously, score a meaningful sample of live traffic, and feed its findings back into the same dashboards your team watches every day.

This post builds that pipeline. We start with LLM-as-judge for automated scoring at scale, then layer in a human feedback loop for the cases automation cannot handle reliably.

Why Traditional Metrics Are Not Enough

BLEU and ROUGE scores measure n-gram overlap between a generated response and a reference answer. They were designed for machine translation and summarization tasks where a reference exists. For open-ended LLM applications — RAG assistants, code generation, customer support bots — there is rarely a single correct reference answer, and surface-level overlap with one that exists tells you almost nothing about whether the response was useful, accurate, or safe.

A response that says “Based on our documentation, the refund window is 30 days” scores zero on BLEU if the reference says “Customers have 30 days to request a refund.” The meaning is identical. BLEU disagrees. Meanwhile a response that hallucinates a plausible-sounding policy in fluent, on-brand language scores well on perplexity metrics while being completely wrong.

LLM-as-judge solves this by using a capable model to evaluate quality the way a human would — reading the output, understanding its context, and scoring it against criteria that matter: faithfulness to retrieved context, relevance to the user query, completeness, tone, and safety.

What LLM-as-Judge Can and Cannot Do

Before building the pipeline, it is worth being honest about the limits. A well-calibrated LLM judge achieves roughly 80% agreement with human evaluators — comparable to human-to-human inter-annotator agreement on the same tasks. At 10,000 evaluations per month, that represents cost savings of $50,000 to $100,000 compared to human review at scale.

But LLM judges have known failure modes you need to design around:

  • Position bias — judges tend to prefer responses that appear first in pairwise comparisons. Mitigate by running both orderings and averaging
  • Self-serving bias — a GPT-based judge tends to favor GPT-based responses. Use a judge from a different model family than the one being evaluated when possible
  • Score compression — judges cluster scores in the middle of a wide scale. Use narrower scales (1-5 or binary pass/fail) with calibration examples for each level
  • Inability to verify external facts — a judge cannot reliably detect hallucinations about specific private data it was not trained on. Combine with faithfulness checks against retrieved context instead

With those constraints in mind, LLM-as-judge is the right tool for: relevance, coherence, tone adherence, faithfulness to provided context, format compliance, and safety screening. It is the wrong tool for: verifying claims against proprietary databases, high-stakes legal or medical decisions, and cases where 80% accuracy is not acceptable.

The Evaluation Pipeline Architecture

flowchart TD
    A[Production LLM Response] --> B{Sampling Gate\n5-10% of traffic}
    B -->|Sampled| C[Evaluation Queue\nAsync processing]
    B -->|Not sampled| Z[Pass through]

    C --> D[Automated Evaluators]
    D --> D1[Faithfulness Judge\nContext vs Response]
    D --> D2[Relevance Judge\nQuery vs Response]
    D --> D3[Safety Classifier\nToxicity / PII]
    D --> D4[Format Validator\nSchema / Length rules]

    D1 --> E[Score Aggregator]
    D2 --> E
    D3 --> E
    D4 --> E

    E --> F{Score Below\nThreshold?}
    F -->|Yes - Low score| G[Human Review Queue]
    F -->|No - Acceptable| H[Metrics Store\nPrometheus / Langfuse]

    G --> I[Human Reviewer\nAnnotation UI]
    I --> J[Human Label\nPass / Fail + Reason]
    J --> H
    J --> K[Calibration Dataset\nGolden Set Updates]

    K --> L[Judge Prompt\nRefinement]
    L --> D

    H --> M[Grafana Dashboard\nQuality Trends]
    H --> N[Alerts\nScore Degradation]

    style C fill:#1e3a5f,color:#ffffff
    style E fill:#1e3a5f,color:#ffffff
    style K fill:#2d4a1e,color:#ffffff

The key design decisions in this architecture are worth calling out explicitly. Evaluation runs asynchronously and never blocks the user response path. Only a sample of traffic is evaluated — typically 5 to 10 percent — because evaluating every request is expensive and unnecessary once you have statistical confidence in your quality distribution. Low-scoring responses are routed to a human review queue rather than discarded, and human labels flow back into a calibration dataset that improves the automated evaluators over time. This is the feedback loop that makes the system self-improving.

Building the Judge Prompt

The quality of your evaluation pipeline lives or dies on the judge prompt. A vague prompt produces vague, inconsistent scores. A well-structured prompt with calibration examples produces reliable, actionable scores.

Here is a production-tested faithfulness judge prompt for a RAG application:

FAITHFULNESS_JUDGE_PROMPT = """
You are an expert evaluator assessing whether an AI assistant's response is faithful 
to the provided context documents. Faithfulness means the response only makes claims 
that are directly supported by the context. A response that adds information not in 
the context -- even if that information is factually correct -- is unfaithful.

## Context Documents
{context}

## User Query
{query}

## AI Response
{response}

## Evaluation Instructions
1. Read the context documents carefully
2. Identify every factual claim in the AI response
3. For each claim, determine if it is directly supported by the context
4. Score the response on the following scale:

Score 1 (FAIL): The response contains one or more major claims not supported by context
Score 2 (FAIL): The response mixes supported claims with minor unsupported additions  
Score 3 (PASS): All major claims are supported; minor elaboration acceptable
Score 4 (PASS): All claims are directly grounded in the context with good specificity
Score 5 (PASS): Perfect faithfulness -- every claim traces directly to a context passage

## Calibration Examples
Score 1 example: Context says "warranty is 1 year." Response says "warranty is 2 years and covers accidental damage."
Score 3 example: Context says "warranty is 1 year." Response says "your product has a one-year warranty from the purchase date."
Score 5 example: Context says "warranty is 1 year from purchase date per section 4.2." Response references the exact same terms.

## Output Format
Respond with valid JSON only. No preamble or explanation outside the JSON object.
{
  "score": ,
  "pass": ,
  "reasoning": "",
  "unsupported_claims": ["", ""] // empty array if none
}
"""

Node.js Implementation

// evaluator.js
const OpenAI = require('openai');
const { recordFaithfulnessScore } = require('./metrics'); // from Part 3

const judgeClient = new OpenAI({ apiKey: process.env.JUDGE_API_KEY });

const FAITHFULNESS_PROMPT = `
You are an expert evaluator assessing whether an AI assistant's response is faithful
to the provided context documents.

## Context Documents
{context}

## User Query
{query}

## AI Response
{response}

Score on a 1-5 scale where 1=completely unfaithful, 5=perfectly faithful.
Respond ONLY with valid JSON: {"score": <1-5>, "pass": , "reasoning": "", "unsupported_claims": []}
`;

async function evaluateFaithfulness({ query, context, response, model, feature }) {
  const prompt = FAITHFULNESS_PROMPT
    .replace('{context}', context.slice(0, 3000)) // limit context to avoid judge token explosion
    .replace('{query}', query)
    .replace('{response}', response);

  try {
    const evaluation = await judgeClient.chat.completions.create({
      model: 'gpt-4o', // use different family from evaluated model when possible
      temperature: 0,  // deterministic scoring
      response_format: { type: 'json_object' },
      messages: [{ role: 'user', content: prompt }],
    });

    const result = JSON.parse(evaluation.choices[0].message.content);

    // Record to Prometheus metrics (from Part 3)
    recordFaithfulnessScore(model, feature, result.score / 5);

    return {
      score: result.score,
      pass: result.pass,
      reasoning: result.reasoning,
      unsupportedClaims: result.unsupported_claims || [],
      needsHumanReview: result.score <= 2,
    };

  } catch (err) {
    console.error('Evaluation failed:', err.message);
    return null;
  }
}

// Sampling gate -- only evaluate a percentage of traffic
function shouldEvaluate(sampleRate = 0.05) {
  return Math.random() < sampleRate;
}

// Async evaluation wrapper -- never blocks the response path
async function evaluateAsync(evalPayload) {
  if (!shouldEvaluate()) return;

  setImmediate(async () => {
    const result = await evaluateFaithfulness(evalPayload);

    if (result?.needsHumanReview) {
      await enqueueForHumanReview({ ...evalPayload, autoScore: result });
    }
  });
}

async function enqueueForHumanReview(payload) {
  // Implement with your queue of choice: SQS, Redis, database queue
  console.log('[HumanReview] Queued for review:', {
    feature: payload.feature,
    score: payload.autoScore?.score,
    reason: payload.autoScore?.reasoning,
  });
}

module.exports = { evaluateAsync, evaluateFaithfulness };

Python Implementation

# evaluator.py
import os
import json
import asyncio
import random
from openai import AsyncOpenAI
from metrics import record_faithfulness_score  # from Part 3

judge_client = AsyncOpenAI(api_key=os.getenv("JUDGE_API_KEY"))

FAITHFULNESS_PROMPT = """
You are an expert evaluator assessing whether an AI assistant's response is faithful
to the provided context documents.

## Context Documents
{context}

## User Query
{query}

## AI Response
{response}

Score on a 1-5 scale where 1=completely unfaithful, 5=perfectly faithful.
Respond ONLY with valid JSON: {{"score": <1-5>, "pass": , "reasoning": "", "unsupported_claims": []}}
"""

RELEVANCE_PROMPT = """
You are an expert evaluator assessing whether an AI response directly and completely
answers the user's query.

## User Query
{query}

## AI Response
{response}

Score on a 1-5 scale where 1=completely irrelevant, 5=perfectly relevant and complete.
Respond ONLY with valid JSON: {{"score": <1-5>, "pass": , "reasoning": ""}}
"""

async def evaluate_faithfulness(query: str, context: str, response: str,
                                  model: str, feature: str) -> dict | None:
    prompt = FAITHFULNESS_PROMPT.format(
        context=context[:3000],
        query=query,
        response=response,
    )

    try:
        evaluation = await judge_client.chat.completions.create(
            model="gpt-4o",
            temperature=0,
            response_format={"type": "json_object"},
            messages=[{"role": "user", "content": prompt}],
        )

        result = json.loads(evaluation.choices[0].message.content)
        record_faithfulness_score(model, feature, result["score"] / 5)

        return {
            "score": result["score"],
            "pass": result["pass"],
            "reasoning": result.get("reasoning", ""),
            "unsupported_claims": result.get("unsupported_claims", []),
            "needs_human_review": result["score"] <= 2,
        }

    except Exception as e:
        print(f"Evaluation failed: {e}")
        return None


async def evaluate_relevance(query: str, response: str,
                              model: str, feature: str) -> dict | None:
    prompt = RELEVANCE_PROMPT.format(query=query, response=response)

    try:
        evaluation = await judge_client.chat.completions.create(
            model="gpt-4o",
            temperature=0,
            response_format={"type": "json_object"},
            messages=[{"role": "user", "content": prompt}],
        )
        return json.loads(evaluation.choices[0].message.content)
    except Exception as e:
        print(f"Relevance eval failed: {e}")
        return None


def should_evaluate(sample_rate: float = 0.05) -> bool:
    return random.random() < sample_rate


async def evaluate_async(payload: dict):
    """Fire-and-forget evaluation -- call after response is sent to user."""
    if not should_evaluate():
        return

    faithfulness = await evaluate_faithfulness(
        query=payload["query"],
        context=payload["context"],
        response=payload["response"],
        model=payload["model"],
        feature=payload["feature"],
    )

    relevance = await evaluate_relevance(
        query=payload["query"],
        response=payload["response"],
        model=payload["model"],
        feature=payload["feature"],
    )

    if faithfulness and faithfulness["needs_human_review"]:
        await enqueue_for_human_review({**payload, "auto_score": faithfulness})


async def enqueue_for_human_review(payload: dict):
    # Implement with SQS, Celery, or a database queue
    print(f"[HumanReview] Queued: feature={payload['feature']} score={payload['auto_score']['score']}")

C# Implementation

// Evaluator.cs
using System.Text.Json;
using Azure.AI.OpenAI;

public class LlmEvaluator
{
    private readonly AzureOpenAIClient _judgeClient;
    private readonly LlmMetrics _metrics;
    private readonly Random _random = new();

    private const string FaithfulnessPromptTemplate = """
        You are an expert evaluator assessing whether an AI assistant's response
        is faithful to the provided context documents.

        ## Context Documents
        {context}

        ## User Query
        {query}

        ## AI Response
        {response}

        Score 1-5 where 1=unfaithful, 5=perfectly faithful.
        Respond ONLY with valid JSON: {"score": <1-5>, "pass": , "reasoning": "", "unsupported_claims": []}
        """;

    public LlmEvaluator(IConfiguration config)
    {
        _judgeClient = new AzureOpenAIClient(
            new Uri(config["AzureOpenAI:Endpoint"]!),
            new System.ClientModel.ApiKeyCredential(config["AzureOpenAI:ApiKey"]!)
        );
    }

    public async Task<EvaluationResult?> EvaluateFaithfulnessAsync(
        string query, string context, string response, string model, string feature)
    {
        var prompt = FaithfulnessPromptTemplate
            .Replace("{context}", context.Length > 3000 ? context[..3000] : context)
            .Replace("{query}", query)
            .Replace("{response}", response);

        try
        {
            var chatClient = _judgeClient.GetChatClient("gpt-4o");
            var completion = await chatClient.CompleteChatAsync(
                [new UserChatMessage(prompt)],
                new ChatCompletionOptions { Temperature = 0 }
            );

            var json = completion.Value.Content[0].Text;
            var result = JsonSerializer.Deserialize<EvaluationResult>(json)!;

            // Record to Prometheus (from Part 3)
            LlmMetrics.FaithfulnessHistogram
                .WithLabels(model, feature)
                .Observe(result.Score / 5.0);

            return result;
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Evaluation failed: {ex.Message}");
            return null;
        }
    }

    public bool ShouldEvaluate(double sampleRate = 0.05)
        => _random.NextDouble() < sampleRate;

    // Fire-and-forget -- never awaited on the response path
    public void EvaluateFireAndForget(EvaluationPayload payload)
    {
        if (!ShouldEvaluate()) return;

        _ = Task.Run(async () =>
        {
            var result = await EvaluateFaithfulnessAsync(
                payload.Query, payload.Context,
                payload.Response, payload.Model, payload.Feature
            );

            if (result is { Score: <= 2 })
                await EnqueueForHumanReviewAsync(payload, result);
        });
    }

    private Task EnqueueForHumanReviewAsync(EvaluationPayload payload, EvaluationResult result)
    {
        Console.WriteLine($"[HumanReview] Queued: feature={payload.Feature} score={result.Score}");
        // Implement with Azure Service Bus, Redis Streams, or EF Core queue table
        return Task.CompletedTask;
    }
}

public record EvaluationResult(
    [property: JsonPropertyName("score")] int Score,
    [property: JsonPropertyName("pass")] bool Pass,
    [property: JsonPropertyName("reasoning")] string Reasoning,
    [property: JsonPropertyName("unsupported_claims")] List<string> UnsupportedClaims
);

public record EvaluationPayload(
    string Query, string Context, string Response, string Model, string Feature
);

The Human Feedback Loop

Automation handles the volume. Human reviewers handle what automation gets wrong -- edge cases, domain-specific nuance, and novel failure modes the judge prompt was not calibrated for. The human feedback loop has three distinct roles in the system.

The first role is catching automation errors. Low-scoring responses routed to human review often reveal judge miscalibration -- cases where the automated score is wrong in a systematic way. These cases become calibration examples that improve the judge prompt.

The second role is building a golden dataset. Human labels on sampled production traffic are the most valuable data your team can collect. A golden set of 200 to 500 human-labeled examples lets you measure judge calibration, run regression testing on prompt changes, and detect quality degradation over time against a fixed reference.

The third role is implicit feedback integration. Not every user will click a thumbs down button, but users do signal quality through their behavior. A user who immediately rephrases their query is signaling a bad response. A user who copies the response and moves on is signaling a good one. Capture these implicit signals alongside explicit ratings in your human feedback pipeline.

Recommended Evaluation Dimensions by Application Type

Application TypePrimary DimensionsHuman Review Trigger
RAG assistantFaithfulness, context relevance, completenessFaithfulness score <= 2
Customer support botRelevance, tone adherence, policy complianceAny content_filter finish reason
Code generationCorrectness (run tests), security, styleStatic analysis warnings
SummarisationFactual coverage, brevity, no hallucinationCoverage score < 0.6
Agentic workflowsGoal completion, tool selection quality, loop detectionMore than 5 tool calls on a single task

Connecting to Langfuse for Production Evaluation

Langfuse supports online evaluation rules natively. You define a judge model and scoring criteria in the dashboard, and it automatically scores a percentage of live traces without any additional code in your application. Here is the configuration for a faithfulness scorer via the Langfuse API:

# Create an online evaluation rule via Langfuse API
import requests

rule = {
    "name": "faithfulness-online",
    "type": "llm",
    "samplingRate": 0.05,
    "scoreName": "faithfulness",
    "model": "gpt-4o",
    "prompt": """
    Given this context: {{context}}
    And this response: {{output}}
    Rate faithfulness 1-5. Return JSON: {"score": , "reasoning": ""}
    """
}

response = requests.post(
    "https://cloud.langfuse.com/api/public/score-configs",
    headers={"Authorization": f"Basic {LANGFUSE_AUTH}"},
    json=rule,
)
print(response.json())

What Comes Next

You now have a continuous evaluation loop running on live traffic -- automated scoring at scale with human review routing for low-confidence cases. In Part 5, we shift from measuring quality to controlling the inputs that drive it: prompt management and versioning, treating your prompts as production code with the same rigor you give application code.

Key Takeaways

  • BLEU and ROUGE measure surface overlap, not meaning -- LLM-as-judge is the right tool for open-ended quality evaluation
  • A well-calibrated LLM judge achieves roughly 80% agreement with human evaluators at 500x to 5000x lower cost
  • Always run evaluation asynchronously and never block the user response path
  • Use a judge from a different model family than the model being evaluated to reduce self-serving bias
  • Use narrow scoring scales (1-5 or binary) with calibration examples -- wide scales produce compressed, useless distributions
  • Route low-scoring responses to human review and use human labels to build a golden dataset that improves your judges over time
  • Implicit user signals (query rephrasing, copy behavior, session abandonment) are valuable feedback even without explicit ratings

References

Written by:

597 Posts

View All Posts
Follow Me :
How to whitelist website on AdBlocker?

How to whitelist website on AdBlocker?

  1. 1 Click on the AdBlock Plus icon on the top right corner of your browser
  2. 2 Click on "Enabled on this site" from the AdBlock Plus option
  3. 3 Refresh the page and start browsing the site