Building a Complete LLMOps Stack: From Zero to Production-Grade Observability → Explore with me!

Over the past seven posts we built each layer of LLM observability separately: distributed tracing with OpenTelemetry, the metrics that actually matter, LLM-as-judge evaluation pipelines, prompt versioning with quality gates, RAG pipeline instrumentation, and cost governance with per-tenant budget enforcement. Each layer is useful on its own. Together they form a coherent operational system — the kind that lets a team confidently say they know what their LLM application is doing in production, why it sometimes goes wrong, and how to fix it faster than the next incident.

This final post does three things. It assembles the complete reference architecture showing how every layer connects. It provides a phased implementation checklist for teams starting from zero, ordered by the highest return on investment first. And it gives you the Docker Compose configuration to stand up the full open-source observability stack locally in under ten minutes.

The Complete LLMOps Stack: Reference Architecture

flowchart TD
    subgraph APP["Application Layer"]
        A1[LLM Chat Handler]
        A2[RAG Pipeline]
        A3[Agentic Workflows]
    end

    subgraph PROMPT["Prompt Management Layer\n(Part 5)"]
        P1[Langfuse Prompt Registry\nVersioned + Immutable]
        P2[Canary Deployment\nLabel-based routing]
        P3[CI/CD Eval Gate\nGolden dataset tests]
    end

    subgraph INST["Instrumentation Layer\n(Parts 2-3)"]
        I1[OpenTelemetry SDK\nNode.js / Python / C#]
        I2[Span Attributes\nGenAI semantic conventions]
        I3[Prometheus Metrics\nLatency, tokens, cost, quality]
    end

    subgraph EVAL["Evaluation Layer\n(Part 4)"]
        E1[LLM-as-Judge\nFaithfulness, Relevance]
        E2[Human Review Queue\nLow-score routing]
        E3[Golden Dataset\nCalibration + regression]
    end

    subgraph RAG["RAG Observability\n(Part 6)"]
        R1[Per-stage Spans\nEmbed, Search, Rank, Assemble]
        R2[Quality Signals\nScore spread, precision, recall]
        R3[Embedding Version Guard\nMismatch detection]
    end

    subgraph COST["Cost Governance\n(Part 7)"]
        C1[Cost Middleware\nPer-feature, per-tenant]
        C2[Redis Budget Enforcement\nSoft + hard limits]
        C3[Anomaly Detection\n2.5x baseline alert]
    end

    subgraph OBS["Observability Backend"]
        O1[OpenTelemetry Collector\nOTLP receiver]
        O2[Langfuse\nTraces + evals + prompts]
        O3[Prometheus\nMetrics store]
        O4[Grafana\nDashboards + alerts]
    end

    APP --> PROMPT
    APP --> INST
    APP --> COST
    INST --> RAG
    INST --> EVAL
    INST --> O1
    COST --> O3
    EVAL --> O2
    RAG --> O1
    O1 --> O2
    O1 --> O3
    O3 --> O4
    O2 --> O4

    PROMPT --> P3
    P3 --> E1

    style APP fill:#0d1b2e,color:#ffffff
    style OBS fill:#1a2d1a,color:#ffffff
    style COST fill:#2d1a0d,color:#ffffff

Tool Selection Reference

There is no single correct LLMOps stack. The tools below represent the open-source default choices used throughout this series. Enterprise alternatives exist at each layer for teams with stricter data residency, SSO, or compliance requirements.

Layer	Open-source default	Enterprise alternative	Purpose
Distributed tracing	OpenTelemetry SDK + Jaeger	Datadog, Dynatrace	Span-level visibility across every request
LLM observability	Langfuse (self-hosted)	Arize AI, LangSmith	Trace storage, prompt management, eval scoring
Metrics collection	Prometheus	Azure Monitor, Datadog	Time-series metrics for latency, cost, quality
Dashboards + alerts	Grafana	Grafana Cloud, Datadog	Visualisation and threshold alerts
LLM gateway	LiteLLM	Azure APIM, AWS Bedrock	Model routing, rate limiting, cost tagging
Prompt registry	Langfuse prompts	PromptLayer, LangSmith Hub	Versioned prompt storage with label-based routing
Evaluation	Custom judge + Langfuse	Braintrust, Maxim AI	Automated quality scoring on live traffic
Vector store	Qdrant	Pinecone, Azure AI Search	Semantic retrieval for RAG pipelines
Cache	Redis	Azure Cache for Redis	Budget counters, prompt cache, semantic cache
Queue (human review)	Redis Streams	AWS SQS, Azure Service Bus	Async routing of low-score responses to reviewers

Local Stack: Docker Compose

The following configuration starts the full observability backend locally. It runs Langfuse for traces and prompt management, Prometheus for metrics, Grafana for dashboards, Jaeger for trace visualization, and Redis for budget enforcement and caching.

# docker-compose.yml
# Run: docker compose up -d
# Grafana:  http://localhost:3000  (admin / admin)
# Langfuse: http://localhost:3001
# Jaeger:   http://localhost:16686
# Prometheus: http://localhost:9090

services:

  # ── Langfuse: LLM traces, prompt management, eval scores ──────────────────
  langfuse-db:
    image: postgres:15
    environment:
      POSTGRES_USER: langfuse
      POSTGRES_PASSWORD: langfuse
      POSTGRES_DB: langfuse
    volumes:
      - langfuse_pg:/var/lib/postgresql/data

  langfuse:
    image: langfuse/langfuse:latest
    depends_on: [langfuse-db]
    ports: ["3001:3000"]
    environment:
      DATABASE_URL: postgresql://langfuse:langfuse@langfuse-db:5432/langfuse
      NEXTAUTH_URL: http://localhost:3001
      NEXTAUTH_SECRET: change-me-in-production
      SALT: change-me-in-production
      TELEMETRY_ENABLED: "false"

  # ── Prometheus: metrics collection ────────────────────────────────────────
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=30d

  # ── Grafana: dashboards and alerts ────────────────────────────────────────
  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    depends_on: [prometheus]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning

  # ── Jaeger: distributed trace visualization ───────────────────────────────
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTLP gRPC
      - "4318:4318"     # OTLP HTTP
    environment:
      COLLECTOR_OTLP_ENABLED: "true"

  # ── OpenTelemetry Collector: fan-out to Langfuse + Jaeger ─────────────────
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4319:4317"     # OTLP gRPC (app → collector)
      - "4320:4318"     # OTLP HTTP (app → collector)
    volumes:
      - ./otel-collector.yml:/etc/otelcol-contrib/config.yaml
    depends_on: [jaeger, langfuse]

  # ── Redis: budget enforcement, semantic cache, review queue ───────────────
  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
    volumes:
      - redis_data:/data

volumes:
  langfuse_pg:
  prometheus_data:
  grafana_data:
  redis_data:

# otel-collector.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  # Forward traces to Jaeger for visualization
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  # Forward traces to Langfuse via OTLP
  otlp/langfuse:
    endpoint: https://cloud.langfuse.com/api/public/otel
    headers:
      Authorization: "Basic ${LANGFUSE_AUTH_HEADER}"

  # Expose Prometheus metrics scrape endpoint
  prometheus:
    endpoint: "0.0.0.0:8889"

processors:
  batch:
    timeout: 5s
    send_batch_size: 512

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger, otlp/langfuse]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "otel-collector"
    static_configs:
      - targets: ["otel-collector:8889"]

  - job_name: "llm-application"
    static_configs:
      - targets: ["host.docker.internal:8080"]  # your app's /metrics endpoint

Phased Implementation Checklist

Implementing every layer at once is neither practical nor necessary. The phases below are ordered by return on investment — each phase gives you meaningful operational visibility before the next one adds depth.

Phase 1: Foundation (Week 1-2) — Stop Flying Blind

Install OpenTelemetry SDK in your application (Node.js, Python, or C#)
Instrument every LLM call with GenAI semantic convention attributes: model, input tokens, output tokens, finish reason, TTFT
Start the Docker Compose stack above locally
Verify traces appear in Jaeger and Langfuse
Add basic Prometheus metrics: request count, latency histogram, token usage counters
Build a minimal Grafana dashboard: request rate, p95 latency, error rate, token usage by model
Set one alert: p95 latency above 5 seconds for 10 minutes

Outcome: You can answer “is the system healthy right now?” and “how much did we spend on tokens today?”

Phase 2: Cost Governance (Week 2-3) — Control the Bill

Add cost middleware from Part 7: tag every API call with feature and tenant labels
Implement Redis-backed rolling spend counters with soft and hard daily budget limits
Add cost panels to Grafana: hourly cost by feature, daily cost by tenant
Set budget alerts: 80% of daily hard limit reached
Run spend anomaly detection as a background job every 5 minutes
Identify your top 3 cost drivers by feature — these are your first optimization targets

Outcome: No more end-of-month billing surprises. Runaway agents get caught within one anomaly detection cycle.

Phase 3: Quality Signals (Week 3-5) — Know When It Goes Wrong

Build a faithfulness judge prompt calibrated to your domain (Part 4)
Implement async evaluation sampling at 5% of production traffic
Route low-scoring responses (score 1-2) to a human review queue
Add quality metrics to Prometheus: faithfulness score histogram, relevance score histogram
Add quality panels to Grafana alongside latency and cost panels
Collect 50 to 100 human-labeled examples in your first two weeks — this is your initial golden dataset
Set a quality alert: average faithfulness below 0.70 over a 1-hour window

Outcome: You know the quality distribution of your production responses and have a baseline to detect regressions against.

Phase 4: Prompt Governance (Week 4-6) — Safe Iteration

Move all prompts out of source code into the Langfuse prompt registry
Tag every trace span with the prompt version that produced it
Add a GitHub Actions evaluation gate: block PRs that modify prompts if golden dataset scores drop below baseline
Implement canary deployment at 5% traffic for any non-patch prompt change
Monitor quality and cost metrics separately for canary vs control traffic for 48 hours before full rollout
Document the rollback procedure: label reassignment with 5-minute cache TTL expiry

Outcome: Prompt changes are tracked, tested, and reversible. The “it worked yesterday” problem has an audit trail.

Phase 5: RAG Observability (Week 5-8) — Fix the Retrieval Layer

Instrument every RAG stage as a child span: embed query, vector search, rerank, assemble context, generate
Add retrieval quality attributes: score spread, docs after filter, context token ratio, truncation flag
Implement embedding model version guard — alert on version mismatch between index and query embeddings
Add RAG-specific Grafana panels: context precision trend, zero-doc retrieval rate, score spread distribution
Run a chunking strategy audit: test fixed vs semantic chunking on your golden dataset and measure faithfulness delta
Set alerts: context precision below 0.6 sustained for 30 minutes; zero-doc retrieval rate above 1%

Outcome: When a RAG response is wrong you can identify which pipeline stage failed within one trace inspection rather than hours of debugging.

Phase 6: Continuous Improvement (Ongoing) — The Flywheel

Weekly review of human-labeled responses to update the golden dataset with real production failures
Monthly judge calibration check: measure judge agreement with human labels, retune prompts if below 75%
Quarterly model cost benchmarking: test newer cheaper models on your golden dataset to identify downgrade candidates
Track cost-per-business-outcome metrics (cost per resolved ticket, cost per completed task) and report to product teams
Build a feedback flywheel: production failures become test cases, test cases improve the judge, the judge catches the next failure earlier

Outcome: The system improves automatically. Quality trends up, cost trends down, and incident response time compresses each quarter.

The Five Questions Your Stack Should Answer

A production LLMOps stack is not a collection of tools. It is the ability to answer operational questions fast enough to act on them. Here is the final test for whether your stack is actually production-grade:

Question	Where you find the answer	Acceptable response time
Is the system healthy right now?	Grafana: latency, error rate, request rate panels	Under 30 seconds
Which responses are low quality today?	Langfuse: filter traces by faithfulness score below 0.6	Under 2 minutes
What changed that caused this regression?	Langfuse: correlate quality drop timestamp with prompt version history	Under 5 minutes
Which feature is driving this cost spike?	Grafana: cost by feature panel, anomaly alert context	Under 1 minute
Which RAG stage failed on this specific bad response?	Jaeger or Langfuse: open the trace, inspect child span attributes	Under 3 minutes

What the Series Built

Looking back across all eight parts, the series built one coherent system from the ground up:

Part 1 — Established why LLMOps is a distinct discipline from MLOps, with a different failure model, a different cost structure, and five observability pillars that traditional monitoring does not cover
Part 2 — Built distributed tracing with OpenTelemetry across Node.js, Python, and C#, using GenAI semantic conventions so your traces are vendor-portable and dashboard-ready from day one
Part 3 — Defined the metrics that actually matter: TTFT vs total latency, finish reason distributions, token cost tracking, output drift signals, and Prometheus implementations with Grafana PromQL and alert runbooks
Part 4 — Built the LLM-as-judge evaluation pipeline with async sampling, human review routing, calibrated judge prompts, and a golden dataset feedback loop that makes the system self-improving
Part 5 — Moved prompts out of source code into a versioned registry with semantic versioning, CI/CD evaluation gates, canary deployment, and no-redeploy rollback via label reassignment
Part 6 — Instrumented every RAG stage as a child span with retrieval quality attributes, embedding version guards, and Grafana panels for context precision, score spread, and truncation rate
Part 7 — Built cost governance with per-feature and per-tenant attribution, Redis-backed budget enforcement, spend anomaly detection, and the three cost reduction levers: model routing, semantic caching, and prompt compression
Part 8 — Assembled the complete reference architecture, the tool selection matrix, a local Docker Compose stack, and a six-phase implementation checklist ordered by return on investment

Where to Go Next

The stack described in this series covers the core of production LLM observability. From here, three directions are worth exploring depending on your application profile.

If you are running agentic systems, the next frontier is multi-agent trace correlation — connecting spans across agent boundaries so you can trace a task from the orchestrator through every tool call and sub-agent back to the original user intent. OpenTelemetry context propagation handles this, but the tooling for visualizing multi-agent graphs is still maturing.

If you are operating in regulated industries, the priority is compliance instrumentation: immutable audit logs of every model input and output, data residency controls on trace storage, and RBAC policies that prevent unauthorized prompt changes from reaching production. Langfuse self-hosted and Opik both support this deployment pattern.

If you are optimizing for cost at scale, the next step after semantic caching is inference optimization: model quantization for self-hosted models, prompt caching on supported APIs (Anthropic and Bedrock both support prefix caching), and distillation of frontier model outputs into smaller task-specific models for your highest-volume features.

Key Takeaways

Start with tracing and cost attribution — these two layers give you the most operational insight per hour of implementation effort
Add quality signals before you add more features — knowing the quality distribution of your production responses is the prerequisite for every improvement decision that follows
The five questions in the table above are your acceptance criteria — your stack is production-grade when you can answer all five in the time windows shown
Every layer described in this series has a working open-source implementation that you can self-host — vendor lock-in is optional, not required
The feedback flywheel is the end goal: production failures become test cases, test cases improve evaluation, evaluation catches the next failure earlier, and the system compounds in quality over time

Building a Complete LLMOps Stack: From Zero to Production-Grade Observability

The Complete LLMOps Stack: Reference Architecture

Tool Selection Reference

Local Stack: Docker Compose

Phased Implementation Checklist

Phase 1: Foundation (Week 1-2) — Stop Flying Blind

Phase 2: Cost Governance (Week 2-3) — Control the Bill

Phase 3: Quality Signals (Week 3-5) — Know When It Goes Wrong

Phase 4: Prompt Governance (Week 4-6) — Safe Iteration

Phase 5: RAG Observability (Week 5-8) — Fix the Retrieval Layer

Phase 6: Continuous Improvement (Ongoing) — The Flywheel

The Five Questions Your Stack Should Answer

What the Series Built

Where to Go Next

Key Takeaways

References

Like this:

You may like

Written by:

Chandan 604 Posts

You May Have Missed

The Complete Picture: Balancing Professional and Personal Support Systems

For Parents, Partners, and Friends: A Guide to Supporting Your Loved One in Tech

The HR Conversation: When and How to Involve HR in Your Mental Health Journey

Finding Your Tech Tribe: The Power of Peer Support Groups

How to whitelist website on AdBlocker?

The Complete LLMOps Stack: Reference Architecture

Tool Selection Reference

Local Stack: Docker Compose

Phased Implementation Checklist

Phase 1: Foundation (Week 1-2) — Stop Flying Blind

Phase 2: Cost Governance (Week 2-3) — Control the Bill

Phase 3: Quality Signals (Week 3-5) — Know When It Goes Wrong

Phase 4: Prompt Governance (Week 4-6) — Safe Iteration

Phase 5: RAG Observability (Week 5-8) — Fix the Retrieval Layer

Phase 6: Continuous Improvement (Ongoing) — The Flywheel

The Five Questions Your Stack Should Answer

What the Series Built

Where to Go Next

Key Takeaways

References

Like this:

You may like

Written by:

Chandan 604 Posts

Related Posts

Cost Governance and FinOps for LLM Workloads

RAG Pipeline Observability: Tracing Retrieval, Chunking, and Embedding Quality

The LLM Landscape in March 2026: Open Source Catches Up, Local AI Goes Mainstream

You May Have Missed

The Complete Picture: Balancing Professional and Personal Support Systems

For Parents, Partners, and Friends: A Guide to Supporting Your Loved One in Tech

The HR Conversation: When and How to Involve HR in Your Mental Health Journey

Finding Your Tech Tribe: The Power of Peer Support Groups

How to whitelist website on AdBlocker?