Production observability transforms from an implementation detail into a business-critical capability as applications scale. Development environments can tolerate collecting every trace and logging every event, but production systems processing millions of requests daily require strategic data collection, intelligent sampling, proactive alerting, and cost-conscious telemetry management. OpenTelemetry with Azure Monitor provides the flexibility to implement sophisticated production monitoring patterns that balance comprehensive visibility with operational efficiency.
This final article in the series explores production-grade observability patterns, demonstrating how to implement intelligent sampling strategies that capture critical events while controlling costs, configure actionable alerts that reduce noise and accelerate incident response, optimize telemetry pipelines for performance and reliability, and build operational excellence through SRE practices and continuous improvement.
Sampling Strategies for Production
Sampling reduces telemetry volume by selectively capturing traces based on configurable rules. Production systems require sophisticated sampling that preserves critical signals while discarding routine operations.
graph TB
subgraph Request Processing
A[Incoming Request]
end
subgraph Head Sampling
B{Sample Decision}
C[Sample: Yes]
D[Sample: No]
end
subgraph Tail Sampling
E[Buffer Complete Trace]
F{Analyze Trace}
G[Error Detected]
H[High Latency]
I[Normal Request]
end
subgraph Sampling Outcomes
J[Always Keep]
K[Keep Based on Policy]
L[Discard]
end
A --> B
B -->|10% Probability| C
B -->|90% Probability| D
C --> E
D --> L
E --> F
F --> G
F --> H
F --> I
G --> J
H --> J
I --> K
K -->|5% Probability| J
K -->|95% Probability| L
subgraph Azure Monitor
M[Application Insights]
end
J --> M
style A fill:#68217a
style B fill:#0078d4
style F fill:#0078d4
style M fill:#00bcf2Head Sampling in .NET
Head sampling makes the sampling decision at trace creation time. This approach is performant but cannot consider trace outcomes like errors or latency.
using Azure.Monitor.OpenTelemetry.AspNetCore;
using OpenTelemetry.Trace;
var builder = WebApplication.CreateBuilder(args);
// Environment-based sampling
var samplingRatio = builder.Environment.IsProduction() ? 0.1 : 1.0;
builder.Services.AddOpenTelemetry()
.WithTracing(tracing =>
{
tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddSqlClientInstrumentation()
.SetSampler(new ParentBasedSampler(
new TraceIdRatioBasedSampler(samplingRatio)
));
})
.UseAzureMonitor();
var app = builder.Build();
app.Run();Advanced Sampling with Custom Logic
using OpenTelemetry.Trace;
using System.Diagnostics;
public class BusinessAwareSampler : Sampler
{
private readonly TraceIdRatioBasedSampler _defaultSampler;
private readonly HashSet _alwaysSamplePaths;
public BusinessAwareSampler(double samplingRatio)
{
_defaultSampler = new TraceIdRatioBasedSampler(samplingRatio);
// Always sample critical business operations
_alwaysSamplePaths = new HashSet
{
"/api/payment",
"/api/checkout",
"/api/refund"
};
}
public override SamplingResult ShouldSample(in SamplingParameters samplingParameters)
{
var activity = Activity.Current;
// Always sample if there's an error
if (activity?.Status == ActivityStatusCode.Error)
{
return new SamplingResult(SamplingDecision.RecordAndSample);
}
// Always sample critical paths
var path = activity?.GetTagItem("http.target") as string;
if (path != null && _alwaysSamplePaths.Any(p => path.StartsWith(p)))
{
return new SamplingResult(SamplingDecision.RecordAndSample);
}
// Always sample slow requests
if (activity?.Duration.TotalMilliseconds > 1000)
{
return new SamplingResult(SamplingDecision.RecordAndSample);
}
// Use default sampling for everything else
return _defaultSampler.ShouldSample(samplingParameters);
}
}
// Register custom sampler
builder.Services.AddOpenTelemetry()
.WithTracing(tracing =>
{
tracing.SetSampler(new ParentBasedSampler(
new BusinessAwareSampler(0.05) // 5% default sampling
));
})
.UseAzureMonitor(); Sampling Configuration for Node.js
const { useAzureMonitor } = require("@azure/monitor-opentelemetry");
const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { ParentBasedSampler, TraceIdRatioBasedSampler } = require("@opentelemetry/sdk-trace-base");
// Production sampling: 10% of traces
const sampler = new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1)
});
const provider = new NodeTracerProvider({
sampler: sampler
});
useAzureMonitor({
samplingRatio: 0.1
});
// Custom sampler for business logic
class PriorityBasedSampler {
shouldSample(context, traceId, spanName, spanKind, attributes) {
// Always sample errors
if (attributes["error"]) {
return {
decision: SamplingDecision.RECORD_AND_SAMPLED
};
}
// Always sample premium customers
if (attributes["customer.tier"] === "premium") {
return {
decision: SamplingDecision.RECORD_AND_SAMPLED
};
}
// Sample 20% of payment operations
if (spanName.includes("payment")) {
return Math.random() < 0.2
? { decision: SamplingDecision.RECORD_AND_SAMPLED }
: { decision: SamplingDecision.NOT_RECORD };
}
// Default 5% sampling
return Math.random() < 0.05
? { decision: SamplingDecision.RECORD_AND_SAMPLED }
: { decision: SamplingDecision.NOT_RECORD };
}
}Python Sampling Configuration
import os
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import (
ParentBasedTraceIdRatio,
ALWAYS_ON,
ALWAYS_OFF
)
from azure.monitor.opentelemetry import configure_azure_monitor
# Production environment detection
is_production = os.environ.get("ENV") == "production"
# Configure sampling based on environment
if is_production:
# Sample 10% in production
sampler = ParentBasedTraceIdRatio(0.1)
else:
# Sample everything in development
sampler = ALWAYS_ON
# Apply sampling configuration
configure_azure_monitor(
connection_string=os.environ.get("APPLICATIONINSIGHTS_CONNECTION_STRING"),
sampling_ratio=0.1 if is_production else 1.0
)Intelligent Alerting Strategies
Effective alerting balances coverage with actionability. Too many alerts create noise and fatigue, while too few miss critical issues. Azure Monitor supports metric-based alerts, log-based alerts, and composite conditions.
Error Rate Alerts
// KQL query for error rate alert
requests
| where timestamp > ago(5m)
| summarize
TotalRequests = count(),
FailedRequests = countif(success == false)
| extend ErrorRate = todouble(FailedRequests) / todouble(TotalRequests) * 100
| where ErrorRate > 5 // Alert when error rate exceeds 5%Latency Percentile Alerts
// Alert on P95 latency degradation
requests
| where timestamp > ago(5m)
| where name == "POST /api/checkout"
| summarize P95Latency = percentile(duration, 95)
| where P95Latency > 2000 // Alert when P95 exceeds 2 secondsCustom Metric Alerts
// Business metric alert
customMetrics
| where name == "orders.processed"
| where timestamp > ago(15m)
| summarize OrdersPerMinute = sum(value) / 15
| where OrdersPerMinute < 10 // Alert when order rate drops below thresholdDependency Failure Alerts
// Database connection failure alert
dependencies
| where timestamp > ago(5m)
| where type == "SQL"
| summarize
TotalCalls = count(),
FailedCalls = countif(success == false)
by name
| extend FailureRate = todouble(FailedCalls) / todouble(TotalCalls) * 100
| where FailureRate > 10Multi-Condition Composite Alert
// Composite condition: High error rate AND low throughput
let ErrorRate = requests
| where timestamp > ago(5m)
| summarize FailureRate = todouble(countif(success == false)) / todouble(count()) * 100;
let Throughput = requests
| where timestamp > ago(5m)
| summarize RequestsPerMinute = count() / 5;
ErrorRate
| join kind=inner Throughput on $left.dummy == $right.dummy
| where FailureRate > 5 and RequestsPerMinute < 50Performance Optimization
OpenTelemetry instrumentation introduces overhead. Production systems require optimization to minimize performance impact while maintaining observability.
Batch Export Configuration
// .NET batch processor optimization
builder.Services.AddOpenTelemetry()
.WithTracing(tracing =>
{
tracing.AddProcessor(new BatchActivityExportProcessor(
new AzureMonitorTraceExporter(options),
maxQueueSize: 2048,
scheduledDelayMilliseconds: 5000,
exporterTimeoutMilliseconds: 30000,
maxExportBatchSize: 512
));
});
// Node.js batch configuration
const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const processor = new BatchSpanProcessor(exporter, {
maxQueueSize: 2048,
scheduledDelayMillis: 5000,
exportTimeoutMillis: 30000,
maxExportBatchSize: 512
});Resource Attribute Optimization
// Minimal resource attributes for production
builder.Services.AddOpenTelemetry()
.ConfigureResource(resource =>
{
resource.AddService(
serviceName: "api-service",
serviceVersion: Environment.GetEnvironmentVariable("APP_VERSION"),
serviceInstanceId: Environment.MachineName
);
// Add only essential attributes
resource.AddAttributes(new Dictionary
{
["deployment.environment"] = "production",
["cloud.region"] = "eastus",
["k8s.pod.name"] = Environment.GetEnvironmentVariable("HOSTNAME")
});
}); Filtering Low-Value Telemetry
// Filter health check requests
builder.Services.AddOpenTelemetry()
.WithTracing(tracing =>
{
tracing.AddAspNetCoreInstrumentation(options =>
{
options.Filter = httpContext =>
{
// Exclude health checks and monitoring probes
var path = httpContext.Request.Path.Value ?? "";
return !path.StartsWith("/health")
&& !path.StartsWith("/metrics")
&& !path.StartsWith("/ready")
&& !path.StartsWith("/alive");
};
});
});Cost Optimization
Azure Monitor costs scale with data ingestion volume. Production deployments require strategic cost management through sampling, retention policies, and selective instrumentation.
Tiered Sampling Strategy
// Different sampling rates per environment and service tier
public class TieredSamplingStrategy
{
public static double GetSamplingRatio(string environment, string serviceTier)
{
return (environment, serviceTier) switch
{
("production", "critical") => 0.5, // 50% for critical services
("production", "standard") => 0.1, // 10% for standard services
("production", "background") => 0.01, // 1% for background jobs
("staging", _) => 0.5, // 50% in staging
("development", _) => 1.0, // 100% in development
_ => 0.1
};
}
}Daily Cap Configuration
Configure daily ingestion caps in Application Insights to prevent unexpected cost overruns. Set caps with appropriate margins and configure alerts when approaching limits.
Retention Policy Optimization
// Configure data retention in Application Insights
// 30 days for standard telemetry
// 90 days for critical business metrics
// 365 days for compliance-required data
// Use continuous export for long-term storage
// Export to Azure Storage for cost-effective archivalOperational Dashboards
Production operations require real-time visibility into system health. Build dashboards that surface key metrics, trends, and anomalies.
Service Health Dashboard Queries
// Request rate and error rate
requests
| where timestamp > ago(1h)
| summarize
RequestRate = count() / 60.0,
ErrorRate = todouble(countif(success == false)) / todouble(count()) * 100
by bin(timestamp, 1m)
| render timechart
// Latency distribution
requests
| where timestamp > ago(1h)
| summarize
P50 = percentile(duration, 50),
P95 = percentile(duration, 95),
P99 = percentile(duration, 99)
by bin(timestamp, 5m)
| render timechart
// Dependency health
dependencies
| where timestamp > ago(1h)
| summarize
CallCount = count(),
FailureCount = countif(success == false),
AvgDuration = avg(duration)
by name, type
| extend SuccessRate = 100.0 - (todouble(FailureCount) / todouble(CallCount) * 100)
| project name, type, CallCount, SuccessRate, AvgDuration
| order by FailureCount descBusiness Metrics Dashboard
// Orders processed trend
customMetrics
| where name == "orders.processed"
| where timestamp > ago(24h)
| summarize OrderCount = sum(value) by bin(timestamp, 1h)
| render timechart
// Revenue metrics
customMetrics
| where name == "order.value"
| where timestamp > ago(24h)
| summarize
TotalRevenue = sum(value),
AverageOrderValue = avg(value),
OrderCount = count()
by bin(timestamp, 1h)
| render timechartIncident Response Integration
Connect Azure Monitor alerts to incident management systems for automated escalation and tracking.
Action Group Configuration
- Configure action groups with multiple notification channels (email, SMS, webhook)
- Integrate with PagerDuty, ServiceNow, or Jira for incident tracking
- Implement escalation policies for different severity levels
- Use Logic Apps or Azure Functions for custom incident workflows
- Configure runbooks for automated remediation of common issues
Continuous Improvement
Production observability requires ongoing refinement. Establish regular reviews of telemetry effectiveness, alert quality, and cost efficiency. Monitor alert fatigue through acknowledgment rates and time-to-resolution metrics. Adjust sampling strategies based on incident analysis and business requirements. Conduct quarterly reviews of instrumentation coverage and dashboard relevance.
Series Conclusion
This seven-part series covered comprehensive OpenTelemetry implementation with Azure Monitor, from foundational concepts through production-grade patterns. You now have the knowledge to instrument applications across .NET, Node.js, and Python, implement distributed tracing across microservices, create custom metrics for business intelligence, and optimize observability for production environments while managing costs.
The journey to effective observability continues beyond implementation. Modern systems evolve constantly, introducing new services, changing traffic patterns, and encountering novel failure modes. Your observability infrastructure must evolve alongside your applications, continuously adapting to new challenges while maintaining operational excellence.
