Building a Complete LLMOps Stack: From Zero to Production-Grade Observability

Building a Complete LLMOps Stack: From Zero to Production-Grade Observability

Over the past seven posts we built each layer of LLM observability separately: distributed tracing with OpenTelemetry, the metrics that actually matter, LLM-as-judge evaluation pipelines, prompt versioning with quality gates, RAG pipeline instrumentation, and cost governance with per-tenant budget enforcement. Each layer is useful on its own. Together they form a coherent operational system — the kind that lets a team confidently say they know what their LLM application is doing in production, why it sometimes goes wrong, and how to fix it faster than the next incident.

This final post does three things. It assembles the complete reference architecture showing how every layer connects. It provides a phased implementation checklist for teams starting from zero, ordered by the highest return on investment first. And it gives you the Docker Compose configuration to stand up the full open-source observability stack locally in under ten minutes.

The Complete LLMOps Stack: Reference Architecture

flowchart TD
    subgraph APP["Application Layer"]
        A1[LLM Chat Handler]
        A2[RAG Pipeline]
        A3[Agentic Workflows]
    end

    subgraph PROMPT["Prompt Management Layer\n(Part 5)"]
        P1[Langfuse Prompt Registry\nVersioned + Immutable]
        P2[Canary Deployment\nLabel-based routing]
        P3[CI/CD Eval Gate\nGolden dataset tests]
    end

    subgraph INST["Instrumentation Layer\n(Parts 2-3)"]
        I1[OpenTelemetry SDK\nNode.js / Python / C#]
        I2[Span Attributes\nGenAI semantic conventions]
        I3[Prometheus Metrics\nLatency, tokens, cost, quality]
    end

    subgraph EVAL["Evaluation Layer\n(Part 4)"]
        E1[LLM-as-Judge\nFaithfulness, Relevance]
        E2[Human Review Queue\nLow-score routing]
        E3[Golden Dataset\nCalibration + regression]
    end

    subgraph RAG["RAG Observability\n(Part 6)"]
        R1[Per-stage Spans\nEmbed, Search, Rank, Assemble]
        R2[Quality Signals\nScore spread, precision, recall]
        R3[Embedding Version Guard\nMismatch detection]
    end

    subgraph COST["Cost Governance\n(Part 7)"]
        C1[Cost Middleware\nPer-feature, per-tenant]
        C2[Redis Budget Enforcement\nSoft + hard limits]
        C3[Anomaly Detection\n2.5x baseline alert]
    end

    subgraph OBS["Observability Backend"]
        O1[OpenTelemetry Collector\nOTLP receiver]
        O2[Langfuse\nTraces + evals + prompts]
        O3[Prometheus\nMetrics store]
        O4[Grafana\nDashboards + alerts]
    end

    APP --> PROMPT
    APP --> INST
    APP --> COST
    INST --> RAG
    INST --> EVAL
    INST --> O1
    COST --> O3
    EVAL --> O2
    RAG --> O1
    O1 --> O2
    O1 --> O3
    O3 --> O4
    O2 --> O4

    PROMPT --> P3
    P3 --> E1

    style APP fill:#0d1b2e,color:#ffffff
    style OBS fill:#1a2d1a,color:#ffffff
    style COST fill:#2d1a0d,color:#ffffff

Tool Selection Reference

There is no single correct LLMOps stack. The tools below represent the open-source default choices used throughout this series. Enterprise alternatives exist at each layer for teams with stricter data residency, SSO, or compliance requirements.

LayerOpen-source defaultEnterprise alternativePurpose
Distributed tracingOpenTelemetry SDK + JaegerDatadog, DynatraceSpan-level visibility across every request
LLM observabilityLangfuse (self-hosted)Arize AI, LangSmithTrace storage, prompt management, eval scoring
Metrics collectionPrometheusAzure Monitor, DatadogTime-series metrics for latency, cost, quality
Dashboards + alertsGrafanaGrafana Cloud, DatadogVisualisation and threshold alerts
LLM gatewayLiteLLMAzure APIM, AWS BedrockModel routing, rate limiting, cost tagging
Prompt registryLangfuse promptsPromptLayer, LangSmith HubVersioned prompt storage with label-based routing
EvaluationCustom judge + LangfuseBraintrust, Maxim AIAutomated quality scoring on live traffic
Vector storeQdrantPinecone, Azure AI SearchSemantic retrieval for RAG pipelines
CacheRedisAzure Cache for RedisBudget counters, prompt cache, semantic cache
Queue (human review)Redis StreamsAWS SQS, Azure Service BusAsync routing of low-score responses to reviewers

Local Stack: Docker Compose

The following configuration starts the full observability backend locally. It runs Langfuse for traces and prompt management, Prometheus for metrics, Grafana for dashboards, Jaeger for trace visualization, and Redis for budget enforcement and caching.

# docker-compose.yml
# Run: docker compose up -d
# Grafana:  http://localhost:3000  (admin / admin)
# Langfuse: http://localhost:3001
# Jaeger:   http://localhost:16686
# Prometheus: http://localhost:9090

services:

  # ── Langfuse: LLM traces, prompt management, eval scores ──────────────────
  langfuse-db:
    image: postgres:15
    environment:
      POSTGRES_USER: langfuse
      POSTGRES_PASSWORD: langfuse
      POSTGRES_DB: langfuse
    volumes:
      - langfuse_pg:/var/lib/postgresql/data

  langfuse:
    image: langfuse/langfuse:latest
    depends_on: [langfuse-db]
    ports: ["3001:3000"]
    environment:
      DATABASE_URL: postgresql://langfuse:langfuse@langfuse-db:5432/langfuse
      NEXTAUTH_URL: http://localhost:3001
      NEXTAUTH_SECRET: change-me-in-production
      SALT: change-me-in-production
      TELEMETRY_ENABLED: "false"

  # ── Prometheus: metrics collection ────────────────────────────────────────
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=30d

  # ── Grafana: dashboards and alerts ────────────────────────────────────────
  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    depends_on: [prometheus]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning

  # ── Jaeger: distributed trace visualization ───────────────────────────────
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTLP gRPC
      - "4318:4318"     # OTLP HTTP
    environment:
      COLLECTOR_OTLP_ENABLED: "true"

  # ── OpenTelemetry Collector: fan-out to Langfuse + Jaeger ─────────────────
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4319:4317"     # OTLP gRPC (app → collector)
      - "4320:4318"     # OTLP HTTP (app → collector)
    volumes:
      - ./otel-collector.yml:/etc/otelcol-contrib/config.yaml
    depends_on: [jaeger, langfuse]

  # ── Redis: budget enforcement, semantic cache, review queue ───────────────
  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
    volumes:
      - redis_data:/data

volumes:
  langfuse_pg:
  prometheus_data:
  grafana_data:
  redis_data:
# otel-collector.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  # Forward traces to Jaeger for visualization
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  # Forward traces to Langfuse via OTLP
  otlp/langfuse:
    endpoint: https://cloud.langfuse.com/api/public/otel
    headers:
      Authorization: "Basic ${LANGFUSE_AUTH_HEADER}"

  # Expose Prometheus metrics scrape endpoint
  prometheus:
    endpoint: "0.0.0.0:8889"

processors:
  batch:
    timeout: 5s
    send_batch_size: 512

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger, otlp/langfuse]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "otel-collector"
    static_configs:
      - targets: ["otel-collector:8889"]

  - job_name: "llm-application"
    static_configs:
      - targets: ["host.docker.internal:8080"]  # your app's /metrics endpoint

Phased Implementation Checklist

Implementing every layer at once is neither practical nor necessary. The phases below are ordered by return on investment — each phase gives you meaningful operational visibility before the next one adds depth.

Phase 1: Foundation (Week 1-2) — Stop Flying Blind

  • Install OpenTelemetry SDK in your application (Node.js, Python, or C#)
  • Instrument every LLM call with GenAI semantic convention attributes: model, input tokens, output tokens, finish reason, TTFT
  • Start the Docker Compose stack above locally
  • Verify traces appear in Jaeger and Langfuse
  • Add basic Prometheus metrics: request count, latency histogram, token usage counters
  • Build a minimal Grafana dashboard: request rate, p95 latency, error rate, token usage by model
  • Set one alert: p95 latency above 5 seconds for 10 minutes

Outcome: You can answer “is the system healthy right now?” and “how much did we spend on tokens today?”

Phase 2: Cost Governance (Week 2-3) — Control the Bill

  • Add cost middleware from Part 7: tag every API call with feature and tenant labels
  • Implement Redis-backed rolling spend counters with soft and hard daily budget limits
  • Add cost panels to Grafana: hourly cost by feature, daily cost by tenant
  • Set budget alerts: 80% of daily hard limit reached
  • Run spend anomaly detection as a background job every 5 minutes
  • Identify your top 3 cost drivers by feature — these are your first optimization targets

Outcome: No more end-of-month billing surprises. Runaway agents get caught within one anomaly detection cycle.

Phase 3: Quality Signals (Week 3-5) — Know When It Goes Wrong

  • Build a faithfulness judge prompt calibrated to your domain (Part 4)
  • Implement async evaluation sampling at 5% of production traffic
  • Route low-scoring responses (score 1-2) to a human review queue
  • Add quality metrics to Prometheus: faithfulness score histogram, relevance score histogram
  • Add quality panels to Grafana alongside latency and cost panels
  • Collect 50 to 100 human-labeled examples in your first two weeks — this is your initial golden dataset
  • Set a quality alert: average faithfulness below 0.70 over a 1-hour window

Outcome: You know the quality distribution of your production responses and have a baseline to detect regressions against.

Phase 4: Prompt Governance (Week 4-6) — Safe Iteration

  • Move all prompts out of source code into the Langfuse prompt registry
  • Tag every trace span with the prompt version that produced it
  • Add a GitHub Actions evaluation gate: block PRs that modify prompts if golden dataset scores drop below baseline
  • Implement canary deployment at 5% traffic for any non-patch prompt change
  • Monitor quality and cost metrics separately for canary vs control traffic for 48 hours before full rollout
  • Document the rollback procedure: label reassignment with 5-minute cache TTL expiry

Outcome: Prompt changes are tracked, tested, and reversible. The “it worked yesterday” problem has an audit trail.

Phase 5: RAG Observability (Week 5-8) — Fix the Retrieval Layer

  • Instrument every RAG stage as a child span: embed query, vector search, rerank, assemble context, generate
  • Add retrieval quality attributes: score spread, docs after filter, context token ratio, truncation flag
  • Implement embedding model version guard — alert on version mismatch between index and query embeddings
  • Add RAG-specific Grafana panels: context precision trend, zero-doc retrieval rate, score spread distribution
  • Run a chunking strategy audit: test fixed vs semantic chunking on your golden dataset and measure faithfulness delta
  • Set alerts: context precision below 0.6 sustained for 30 minutes; zero-doc retrieval rate above 1%

Outcome: When a RAG response is wrong you can identify which pipeline stage failed within one trace inspection rather than hours of debugging.

Phase 6: Continuous Improvement (Ongoing) — The Flywheel

  • Weekly review of human-labeled responses to update the golden dataset with real production failures
  • Monthly judge calibration check: measure judge agreement with human labels, retune prompts if below 75%
  • Quarterly model cost benchmarking: test newer cheaper models on your golden dataset to identify downgrade candidates
  • Track cost-per-business-outcome metrics (cost per resolved ticket, cost per completed task) and report to product teams
  • Build a feedback flywheel: production failures become test cases, test cases improve the judge, the judge catches the next failure earlier

Outcome: The system improves automatically. Quality trends up, cost trends down, and incident response time compresses each quarter.

The Five Questions Your Stack Should Answer

A production LLMOps stack is not a collection of tools. It is the ability to answer operational questions fast enough to act on them. Here is the final test for whether your stack is actually production-grade:

QuestionWhere you find the answerAcceptable response time
Is the system healthy right now?Grafana: latency, error rate, request rate panelsUnder 30 seconds
Which responses are low quality today?Langfuse: filter traces by faithfulness score below 0.6Under 2 minutes
What changed that caused this regression?Langfuse: correlate quality drop timestamp with prompt version historyUnder 5 minutes
Which feature is driving this cost spike?Grafana: cost by feature panel, anomaly alert contextUnder 1 minute
Which RAG stage failed on this specific bad response?Jaeger or Langfuse: open the trace, inspect child span attributesUnder 3 minutes

What the Series Built

Looking back across all eight parts, the series built one coherent system from the ground up:

  • Part 1 — Established why LLMOps is a distinct discipline from MLOps, with a different failure model, a different cost structure, and five observability pillars that traditional monitoring does not cover
  • Part 2 — Built distributed tracing with OpenTelemetry across Node.js, Python, and C#, using GenAI semantic conventions so your traces are vendor-portable and dashboard-ready from day one
  • Part 3 — Defined the metrics that actually matter: TTFT vs total latency, finish reason distributions, token cost tracking, output drift signals, and Prometheus implementations with Grafana PromQL and alert runbooks
  • Part 4 — Built the LLM-as-judge evaluation pipeline with async sampling, human review routing, calibrated judge prompts, and a golden dataset feedback loop that makes the system self-improving
  • Part 5 — Moved prompts out of source code into a versioned registry with semantic versioning, CI/CD evaluation gates, canary deployment, and no-redeploy rollback via label reassignment
  • Part 6 — Instrumented every RAG stage as a child span with retrieval quality attributes, embedding version guards, and Grafana panels for context precision, score spread, and truncation rate
  • Part 7 — Built cost governance with per-feature and per-tenant attribution, Redis-backed budget enforcement, spend anomaly detection, and the three cost reduction levers: model routing, semantic caching, and prompt compression
  • Part 8 — Assembled the complete reference architecture, the tool selection matrix, a local Docker Compose stack, and a six-phase implementation checklist ordered by return on investment

Where to Go Next

The stack described in this series covers the core of production LLM observability. From here, three directions are worth exploring depending on your application profile.

If you are running agentic systems, the next frontier is multi-agent trace correlation — connecting spans across agent boundaries so you can trace a task from the orchestrator through every tool call and sub-agent back to the original user intent. OpenTelemetry context propagation handles this, but the tooling for visualizing multi-agent graphs is still maturing.

If you are operating in regulated industries, the priority is compliance instrumentation: immutable audit logs of every model input and output, data residency controls on trace storage, and RBAC policies that prevent unauthorized prompt changes from reaching production. Langfuse self-hosted and Opik both support this deployment pattern.

If you are optimizing for cost at scale, the next step after semantic caching is inference optimization: model quantization for self-hosted models, prompt caching on supported APIs (Anthropic and Bedrock both support prefix caching), and distillation of frontier model outputs into smaller task-specific models for your highest-volume features.

Key Takeaways

  • Start with tracing and cost attribution — these two layers give you the most operational insight per hour of implementation effort
  • Add quality signals before you add more features — knowing the quality distribution of your production responses is the prerequisite for every improvement decision that follows
  • The five questions in the table above are your acceptance criteria — your stack is production-grade when you can answer all five in the time windows shown
  • Every layer described in this series has a working open-source implementation that you can self-host — vendor lock-in is optional, not required
  • The feedback flywheel is the end goal: production failures become test cases, test cases improve evaluation, evaluation catches the next failure earlier, and the system compounds in quality over time

References

Written by:

604 Posts

View All Posts
Follow Me :
How to whitelist website on AdBlocker?

How to whitelist website on AdBlocker?

  1. 1 Click on the AdBlock Plus icon on the top right corner of your browser
  2. 2 Click on "Enabled on this site" from the AdBlock Plus option
  3. 3 Refresh the page and start browsing the site