Over the past seven posts we built each layer of LLM observability separately: distributed tracing with OpenTelemetry, the metrics that actually matter, LLM-as-judge evaluation pipelines, prompt versioning with quality gates, RAG pipeline instrumentation, and cost governance with per-tenant budget enforcement. Each layer is useful on its own. Together they form a coherent operational system — the kind that lets a team confidently say they know what their LLM application is doing in production, why it sometimes goes wrong, and how to fix it faster than the next incident.
This final post does three things. It assembles the complete reference architecture showing how every layer connects. It provides a phased implementation checklist for teams starting from zero, ordered by the highest return on investment first. And it gives you the Docker Compose configuration to stand up the full open-source observability stack locally in under ten minutes.
The Complete LLMOps Stack: Reference Architecture
flowchart TD
subgraph APP["Application Layer"]
A1[LLM Chat Handler]
A2[RAG Pipeline]
A3[Agentic Workflows]
end
subgraph PROMPT["Prompt Management Layer\n(Part 5)"]
P1[Langfuse Prompt Registry\nVersioned + Immutable]
P2[Canary Deployment\nLabel-based routing]
P3[CI/CD Eval Gate\nGolden dataset tests]
end
subgraph INST["Instrumentation Layer\n(Parts 2-3)"]
I1[OpenTelemetry SDK\nNode.js / Python / C#]
I2[Span Attributes\nGenAI semantic conventions]
I3[Prometheus Metrics\nLatency, tokens, cost, quality]
end
subgraph EVAL["Evaluation Layer\n(Part 4)"]
E1[LLM-as-Judge\nFaithfulness, Relevance]
E2[Human Review Queue\nLow-score routing]
E3[Golden Dataset\nCalibration + regression]
end
subgraph RAG["RAG Observability\n(Part 6)"]
R1[Per-stage Spans\nEmbed, Search, Rank, Assemble]
R2[Quality Signals\nScore spread, precision, recall]
R3[Embedding Version Guard\nMismatch detection]
end
subgraph COST["Cost Governance\n(Part 7)"]
C1[Cost Middleware\nPer-feature, per-tenant]
C2[Redis Budget Enforcement\nSoft + hard limits]
C3[Anomaly Detection\n2.5x baseline alert]
end
subgraph OBS["Observability Backend"]
O1[OpenTelemetry Collector\nOTLP receiver]
O2[Langfuse\nTraces + evals + prompts]
O3[Prometheus\nMetrics store]
O4[Grafana\nDashboards + alerts]
end
APP --> PROMPT
APP --> INST
APP --> COST
INST --> RAG
INST --> EVAL
INST --> O1
COST --> O3
EVAL --> O2
RAG --> O1
O1 --> O2
O1 --> O3
O3 --> O4
O2 --> O4
PROMPT --> P3
P3 --> E1
style APP fill:#0d1b2e,color:#ffffff
style OBS fill:#1a2d1a,color:#ffffff
style COST fill:#2d1a0d,color:#ffffff
Tool Selection Reference
There is no single correct LLMOps stack. The tools below represent the open-source default choices used throughout this series. Enterprise alternatives exist at each layer for teams with stricter data residency, SSO, or compliance requirements.
| Layer | Open-source default | Enterprise alternative | Purpose |
|---|---|---|---|
| Distributed tracing | OpenTelemetry SDK + Jaeger | Datadog, Dynatrace | Span-level visibility across every request |
| LLM observability | Langfuse (self-hosted) | Arize AI, LangSmith | Trace storage, prompt management, eval scoring |
| Metrics collection | Prometheus | Azure Monitor, Datadog | Time-series metrics for latency, cost, quality |
| Dashboards + alerts | Grafana | Grafana Cloud, Datadog | Visualisation and threshold alerts |
| LLM gateway | LiteLLM | Azure APIM, AWS Bedrock | Model routing, rate limiting, cost tagging |
| Prompt registry | Langfuse prompts | PromptLayer, LangSmith Hub | Versioned prompt storage with label-based routing |
| Evaluation | Custom judge + Langfuse | Braintrust, Maxim AI | Automated quality scoring on live traffic |
| Vector store | Qdrant | Pinecone, Azure AI Search | Semantic retrieval for RAG pipelines |
| Cache | Redis | Azure Cache for Redis | Budget counters, prompt cache, semantic cache |
| Queue (human review) | Redis Streams | AWS SQS, Azure Service Bus | Async routing of low-score responses to reviewers |
Local Stack: Docker Compose
The following configuration starts the full observability backend locally. It runs Langfuse for traces and prompt management, Prometheus for metrics, Grafana for dashboards, Jaeger for trace visualization, and Redis for budget enforcement and caching.
# docker-compose.yml
# Run: docker compose up -d
# Grafana: http://localhost:3000 (admin / admin)
# Langfuse: http://localhost:3001
# Jaeger: http://localhost:16686
# Prometheus: http://localhost:9090
services:
# ── Langfuse: LLM traces, prompt management, eval scores ──────────────────
langfuse-db:
image: postgres:15
environment:
POSTGRES_USER: langfuse
POSTGRES_PASSWORD: langfuse
POSTGRES_DB: langfuse
volumes:
- langfuse_pg:/var/lib/postgresql/data
langfuse:
image: langfuse/langfuse:latest
depends_on: [langfuse-db]
ports: ["3001:3000"]
environment:
DATABASE_URL: postgresql://langfuse:langfuse@langfuse-db:5432/langfuse
NEXTAUTH_URL: http://localhost:3001
NEXTAUTH_SECRET: change-me-in-production
SALT: change-me-in-production
TELEMETRY_ENABLED: "false"
# ── Prometheus: metrics collection ────────────────────────────────────────
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.retention.time=30d
# ── Grafana: dashboards and alerts ────────────────────────────────────────
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
depends_on: [prometheus]
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
# ── Jaeger: distributed trace visualization ───────────────────────────────
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
COLLECTOR_OTLP_ENABLED: "true"
# ── OpenTelemetry Collector: fan-out to Langfuse + Jaeger ─────────────────
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
ports:
- "4319:4317" # OTLP gRPC (app → collector)
- "4320:4318" # OTLP HTTP (app → collector)
volumes:
- ./otel-collector.yml:/etc/otelcol-contrib/config.yaml
depends_on: [jaeger, langfuse]
# ── Redis: budget enforcement, semantic cache, review queue ───────────────
redis:
image: redis:7-alpine
ports: ["6379:6379"]
volumes:
- redis_data:/data
volumes:
langfuse_pg:
prometheus_data:
grafana_data:
redis_data:
# otel-collector.yml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
# Forward traces to Jaeger for visualization
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
# Forward traces to Langfuse via OTLP
otlp/langfuse:
endpoint: https://cloud.langfuse.com/api/public/otel
headers:
Authorization: "Basic ${LANGFUSE_AUTH_HEADER}"
# Expose Prometheus metrics scrape endpoint
prometheus:
endpoint: "0.0.0.0:8889"
processors:
batch:
timeout: 5s
send_batch_size: 512
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/jaeger, otlp/langfuse]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "otel-collector"
static_configs:
- targets: ["otel-collector:8889"]
- job_name: "llm-application"
static_configs:
- targets: ["host.docker.internal:8080"] # your app's /metrics endpoint
Phased Implementation Checklist
Implementing every layer at once is neither practical nor necessary. The phases below are ordered by return on investment — each phase gives you meaningful operational visibility before the next one adds depth.
Phase 1: Foundation (Week 1-2) — Stop Flying Blind
- Install OpenTelemetry SDK in your application (Node.js, Python, or C#)
- Instrument every LLM call with GenAI semantic convention attributes: model, input tokens, output tokens, finish reason, TTFT
- Start the Docker Compose stack above locally
- Verify traces appear in Jaeger and Langfuse
- Add basic Prometheus metrics: request count, latency histogram, token usage counters
- Build a minimal Grafana dashboard: request rate, p95 latency, error rate, token usage by model
- Set one alert: p95 latency above 5 seconds for 10 minutes
Outcome: You can answer “is the system healthy right now?” and “how much did we spend on tokens today?”
Phase 2: Cost Governance (Week 2-3) — Control the Bill
- Add cost middleware from Part 7: tag every API call with feature and tenant labels
- Implement Redis-backed rolling spend counters with soft and hard daily budget limits
- Add cost panels to Grafana: hourly cost by feature, daily cost by tenant
- Set budget alerts: 80% of daily hard limit reached
- Run spend anomaly detection as a background job every 5 minutes
- Identify your top 3 cost drivers by feature — these are your first optimization targets
Outcome: No more end-of-month billing surprises. Runaway agents get caught within one anomaly detection cycle.
Phase 3: Quality Signals (Week 3-5) — Know When It Goes Wrong
- Build a faithfulness judge prompt calibrated to your domain (Part 4)
- Implement async evaluation sampling at 5% of production traffic
- Route low-scoring responses (score 1-2) to a human review queue
- Add quality metrics to Prometheus: faithfulness score histogram, relevance score histogram
- Add quality panels to Grafana alongside latency and cost panels
- Collect 50 to 100 human-labeled examples in your first two weeks — this is your initial golden dataset
- Set a quality alert: average faithfulness below 0.70 over a 1-hour window
Outcome: You know the quality distribution of your production responses and have a baseline to detect regressions against.
Phase 4: Prompt Governance (Week 4-6) — Safe Iteration
- Move all prompts out of source code into the Langfuse prompt registry
- Tag every trace span with the prompt version that produced it
- Add a GitHub Actions evaluation gate: block PRs that modify prompts if golden dataset scores drop below baseline
- Implement canary deployment at 5% traffic for any non-patch prompt change
- Monitor quality and cost metrics separately for canary vs control traffic for 48 hours before full rollout
- Document the rollback procedure: label reassignment with 5-minute cache TTL expiry
Outcome: Prompt changes are tracked, tested, and reversible. The “it worked yesterday” problem has an audit trail.
Phase 5: RAG Observability (Week 5-8) — Fix the Retrieval Layer
- Instrument every RAG stage as a child span: embed query, vector search, rerank, assemble context, generate
- Add retrieval quality attributes: score spread, docs after filter, context token ratio, truncation flag
- Implement embedding model version guard — alert on version mismatch between index and query embeddings
- Add RAG-specific Grafana panels: context precision trend, zero-doc retrieval rate, score spread distribution
- Run a chunking strategy audit: test fixed vs semantic chunking on your golden dataset and measure faithfulness delta
- Set alerts: context precision below 0.6 sustained for 30 minutes; zero-doc retrieval rate above 1%
Outcome: When a RAG response is wrong you can identify which pipeline stage failed within one trace inspection rather than hours of debugging.
Phase 6: Continuous Improvement (Ongoing) — The Flywheel
- Weekly review of human-labeled responses to update the golden dataset with real production failures
- Monthly judge calibration check: measure judge agreement with human labels, retune prompts if below 75%
- Quarterly model cost benchmarking: test newer cheaper models on your golden dataset to identify downgrade candidates
- Track cost-per-business-outcome metrics (cost per resolved ticket, cost per completed task) and report to product teams
- Build a feedback flywheel: production failures become test cases, test cases improve the judge, the judge catches the next failure earlier
Outcome: The system improves automatically. Quality trends up, cost trends down, and incident response time compresses each quarter.
The Five Questions Your Stack Should Answer
A production LLMOps stack is not a collection of tools. It is the ability to answer operational questions fast enough to act on them. Here is the final test for whether your stack is actually production-grade:
| Question | Where you find the answer | Acceptable response time |
|---|---|---|
| Is the system healthy right now? | Grafana: latency, error rate, request rate panels | Under 30 seconds |
| Which responses are low quality today? | Langfuse: filter traces by faithfulness score below 0.6 | Under 2 minutes |
| What changed that caused this regression? | Langfuse: correlate quality drop timestamp with prompt version history | Under 5 minutes |
| Which feature is driving this cost spike? | Grafana: cost by feature panel, anomaly alert context | Under 1 minute |
| Which RAG stage failed on this specific bad response? | Jaeger or Langfuse: open the trace, inspect child span attributes | Under 3 minutes |
What the Series Built
Looking back across all eight parts, the series built one coherent system from the ground up:
- Part 1 — Established why LLMOps is a distinct discipline from MLOps, with a different failure model, a different cost structure, and five observability pillars that traditional monitoring does not cover
- Part 2 — Built distributed tracing with OpenTelemetry across Node.js, Python, and C#, using GenAI semantic conventions so your traces are vendor-portable and dashboard-ready from day one
- Part 3 — Defined the metrics that actually matter: TTFT vs total latency, finish reason distributions, token cost tracking, output drift signals, and Prometheus implementations with Grafana PromQL and alert runbooks
- Part 4 — Built the LLM-as-judge evaluation pipeline with async sampling, human review routing, calibrated judge prompts, and a golden dataset feedback loop that makes the system self-improving
- Part 5 — Moved prompts out of source code into a versioned registry with semantic versioning, CI/CD evaluation gates, canary deployment, and no-redeploy rollback via label reassignment
- Part 6 — Instrumented every RAG stage as a child span with retrieval quality attributes, embedding version guards, and Grafana panels for context precision, score spread, and truncation rate
- Part 7 — Built cost governance with per-feature and per-tenant attribution, Redis-backed budget enforcement, spend anomaly detection, and the three cost reduction levers: model routing, semantic caching, and prompt compression
- Part 8 — Assembled the complete reference architecture, the tool selection matrix, a local Docker Compose stack, and a six-phase implementation checklist ordered by return on investment
Where to Go Next
The stack described in this series covers the core of production LLM observability. From here, three directions are worth exploring depending on your application profile.
If you are running agentic systems, the next frontier is multi-agent trace correlation — connecting spans across agent boundaries so you can trace a task from the orchestrator through every tool call and sub-agent back to the original user intent. OpenTelemetry context propagation handles this, but the tooling for visualizing multi-agent graphs is still maturing.
If you are operating in regulated industries, the priority is compliance instrumentation: immutable audit logs of every model input and output, data residency controls on trace storage, and RBAC policies that prevent unauthorized prompt changes from reaching production. Langfuse self-hosted and Opik both support this deployment pattern.
If you are optimizing for cost at scale, the next step after semantic caching is inference optimization: model quantization for self-hosted models, prompt caching on supported APIs (Anthropic and Bedrock both support prefix caching), and distillation of frontier model outputs into smaller task-specific models for your highest-volume features.
Key Takeaways
- Start with tracing and cost attribution — these two layers give you the most operational insight per hour of implementation effort
- Add quality signals before you add more features — knowing the quality distribution of your production responses is the prerequisite for every improvement decision that follows
- The five questions in the table above are your acceptance criteria — your stack is production-grade when you can answer all five in the time windows shown
- Every layer described in this series has a working open-source implementation that you can self-host — vendor lock-in is optional, not required
- The feedback flywheel is the end goal: production failures become test cases, test cases improve evaluation, evaluation catches the next failure earlier, and the system compounds in quality over time
References
- Open Source LLMOps Stack – LiteLLM and Langfuse reference implementation (https://oss-llmops-stack.com/)
- Fractal Analytics – “LLMOps for Enterprise Generative AI: Architecture, Observability, and Scalable AI Operations” (https://fractal.ai/blog/enterprise-llmops-architecture)
- Comet – “LLMOps Guide: From Prototype to Production” (https://www.comet.com/site/blog/llmops/)
- TrueFoundry – “LLMOps Architecture: A Detailed Explanation” (https://www.truefoundry.com/blog/llmops-architecture)
- Redis – “LLMOps Guide 2026: Build Fast, Cost-Effective LLM Apps” (https://redis.io/blog/large-language-model-operations-guide/)
- OneReach – “LLMOps for AI Agents: Monitoring, Testing and Iteration in Production” (https://onereach.ai/blog/llmops-for-ai-agents-in-production/)
