Azure Monitor with OpenTelemetry Part 7: Production Monitoring and Observability Patterns → Explore with me!

Production observability transforms from an implementation detail into a business-critical capability as applications scale. Development environments can tolerate collecting every trace and logging every event, but production systems processing millions of requests daily require strategic data collection, intelligent sampling, proactive alerting, and cost-conscious telemetry management. OpenTelemetry with Azure Monitor provides the flexibility to implement sophisticated production monitoring patterns that balance comprehensive visibility with operational efficiency.

This final article in the series explores production-grade observability patterns, demonstrating how to implement intelligent sampling strategies that capture critical events while controlling costs, configure actionable alerts that reduce noise and accelerate incident response, optimize telemetry pipelines for performance and reliability, and build operational excellence through SRE practices and continuous improvement.

Sampling Strategies for Production

Sampling reduces telemetry volume by selectively capturing traces based on configurable rules. Production systems require sophisticated sampling that preserves critical signals while discarding routine operations.

graph TB
    subgraph Request Processing
        A[Incoming Request]
    end
    
    subgraph Head Sampling
        B{Sample Decision}
        C[Sample: Yes]
        D[Sample: No]
    end
    
    subgraph Tail Sampling
        E[Buffer Complete Trace]
        F{Analyze Trace}
        G[Error Detected]
        H[High Latency]
        I[Normal Request]
    end
    
    subgraph Sampling Outcomes
        J[Always Keep]
        K[Keep Based on Policy]
        L[Discard]
    end
    
    A --> B
    B -->|10% Probability| C
    B -->|90% Probability| D
    
    C --> E
    D --> L
    
    E --> F
    F --> G
    F --> H
    F --> I
    
    G --> J
    H --> J
    I --> K
    K -->|5% Probability| J
    K -->|95% Probability| L
    
    subgraph Azure Monitor
        M[Application Insights]
    end
    
    J --> M
    
    style A fill:#68217a
    style B fill:#0078d4
    style F fill:#0078d4
    style M fill:#00bcf2

Head Sampling in .NET

Head sampling makes the sampling decision at trace creation time. This approach is performant but cannot consider trace outcomes like errors or latency.

using Azure.Monitor.OpenTelemetry.AspNetCore;
using OpenTelemetry.Trace;

var builder = WebApplication.CreateBuilder(args);

// Environment-based sampling
var samplingRatio = builder.Environment.IsProduction() ? 0.1 : 1.0;

builder.Services.AddOpenTelemetry()
    .WithTracing(tracing =>
    {
        tracing
            .AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation()
            .AddSqlClientInstrumentation()
            .SetSampler(new ParentBasedSampler(
                new TraceIdRatioBasedSampler(samplingRatio)
            ));
    })
    .UseAzureMonitor();

var app = builder.Build();
app.Run();

Advanced Sampling with Custom Logic

using OpenTelemetry.Trace;
using System.Diagnostics;

public class BusinessAwareSampler : Sampler
{
    private readonly TraceIdRatioBasedSampler _defaultSampler;
    private readonly HashSet _alwaysSamplePaths;
    
    public BusinessAwareSampler(double samplingRatio)
    {
        _defaultSampler = new TraceIdRatioBasedSampler(samplingRatio);
        
        // Always sample critical business operations
        _alwaysSamplePaths = new HashSet
        {
            "/api/payment",
            "/api/checkout",
            "/api/refund"
        };
    }
    
    public override SamplingResult ShouldSample(in SamplingParameters samplingParameters)
    {
        var activity = Activity.Current;
        
        // Always sample if there's an error
        if (activity?.Status == ActivityStatusCode.Error)
        {
            return new SamplingResult(SamplingDecision.RecordAndSample);
        }
        
        // Always sample critical paths
        var path = activity?.GetTagItem("http.target") as string;
        if (path != null && _alwaysSamplePaths.Any(p => path.StartsWith(p)))
        {
            return new SamplingResult(SamplingDecision.RecordAndSample);
        }
        
        // Always sample slow requests
        if (activity?.Duration.TotalMilliseconds > 1000)
        {
            return new SamplingResult(SamplingDecision.RecordAndSample);
        }
        
        // Use default sampling for everything else
        return _defaultSampler.ShouldSample(samplingParameters);
    }
}

// Register custom sampler
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing =>
    {
        tracing.SetSampler(new ParentBasedSampler(
            new BusinessAwareSampler(0.05) // 5% default sampling
        ));
    })
    .UseAzureMonitor();

Sampling Configuration for Node.js

const { useAzureMonitor } = require("@azure/monitor-opentelemetry");
const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { ParentBasedSampler, TraceIdRatioBasedSampler } = require("@opentelemetry/sdk-trace-base");

// Production sampling: 10% of traces
const sampler = new ParentBasedSampler({
    root: new TraceIdRatioBasedSampler(0.1)
});

const provider = new NodeTracerProvider({
    sampler: sampler
});

useAzureMonitor({
    samplingRatio: 0.1
});

// Custom sampler for business logic
class PriorityBasedSampler {
    shouldSample(context, traceId, spanName, spanKind, attributes) {
        // Always sample errors
        if (attributes["error"]) {
            return {
                decision: SamplingDecision.RECORD_AND_SAMPLED
            };
        }
        
        // Always sample premium customers
        if (attributes["customer.tier"] === "premium") {
            return {
                decision: SamplingDecision.RECORD_AND_SAMPLED
            };
        }
        
        // Sample 20% of payment operations
        if (spanName.includes("payment")) {
            return Math.random() < 0.2
                ? { decision: SamplingDecision.RECORD_AND_SAMPLED }
                : { decision: SamplingDecision.NOT_RECORD };
        }
        
        // Default 5% sampling
        return Math.random() < 0.05
            ? { decision: SamplingDecision.RECORD_AND_SAMPLED }
            : { decision: SamplingDecision.NOT_RECORD };
    }
}

Python Sampling Configuration

import os
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import (
    ParentBasedTraceIdRatio,
    ALWAYS_ON,
    ALWAYS_OFF
)
from azure.monitor.opentelemetry import configure_azure_monitor

# Production environment detection
is_production = os.environ.get("ENV") == "production"

# Configure sampling based on environment
if is_production:
    # Sample 10% in production
    sampler = ParentBasedTraceIdRatio(0.1)
else:
    # Sample everything in development
    sampler = ALWAYS_ON

# Apply sampling configuration
configure_azure_monitor(
    connection_string=os.environ.get("APPLICATIONINSIGHTS_CONNECTION_STRING"),
    sampling_ratio=0.1 if is_production else 1.0
)

Intelligent Alerting Strategies

Effective alerting balances coverage with actionability. Too many alerts create noise and fatigue, while too few miss critical issues. Azure Monitor supports metric-based alerts, log-based alerts, and composite conditions.

Error Rate Alerts

// KQL query for error rate alert
requests
| where timestamp > ago(5m)
| summarize 
    TotalRequests = count(),
    FailedRequests = countif(success == false)
| extend ErrorRate = todouble(FailedRequests) / todouble(TotalRequests) * 100
| where ErrorRate > 5 // Alert when error rate exceeds 5%

Latency Percentile Alerts

// Alert on P95 latency degradation
requests
| where timestamp > ago(5m)
| where name == "POST /api/checkout"
| summarize P95Latency = percentile(duration, 95)
| where P95Latency > 2000 // Alert when P95 exceeds 2 seconds

Custom Metric Alerts

// Business metric alert
customMetrics
| where name == "orders.processed"
| where timestamp > ago(15m)
| summarize OrdersPerMinute = sum(value) / 15
| where OrdersPerMinute < 10 // Alert when order rate drops below threshold

Dependency Failure Alerts

// Database connection failure alert
dependencies
| where timestamp > ago(5m)
| where type == "SQL"
| summarize 
    TotalCalls = count(),
    FailedCalls = countif(success == false)
    by name
| extend FailureRate = todouble(FailedCalls) / todouble(TotalCalls) * 100
| where FailureRate > 10

Multi-Condition Composite Alert

// Composite condition: High error rate AND low throughput
let ErrorRate = requests
| where timestamp > ago(5m)
| summarize FailureRate = todouble(countif(success == false)) / todouble(count()) * 100;

let Throughput = requests
| where timestamp > ago(5m)
| summarize RequestsPerMinute = count() / 5;

ErrorRate
| join kind=inner Throughput on $left.dummy == $right.dummy
| where FailureRate > 5 and RequestsPerMinute < 50

Performance Optimization

OpenTelemetry instrumentation introduces overhead. Production systems require optimization to minimize performance impact while maintaining observability.

Batch Export Configuration

// .NET batch processor optimization
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing =>
    {
        tracing.AddProcessor(new BatchActivityExportProcessor(
            new AzureMonitorTraceExporter(options),
            maxQueueSize: 2048,
            scheduledDelayMilliseconds: 5000,
            exporterTimeoutMilliseconds: 30000,
            maxExportBatchSize: 512
        ));
    });

// Node.js batch configuration
const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");

const processor = new BatchSpanProcessor(exporter, {
    maxQueueSize: 2048,
    scheduledDelayMillis: 5000,
    exportTimeoutMillis: 30000,
    maxExportBatchSize: 512
});

Resource Attribute Optimization

// Minimal resource attributes for production
builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource =>
    {
        resource.AddService(
            serviceName: "api-service",
            serviceVersion: Environment.GetEnvironmentVariable("APP_VERSION"),
            serviceInstanceId: Environment.MachineName
        );
        
        // Add only essential attributes
        resource.AddAttributes(new Dictionary
        {
            ["deployment.environment"] = "production",
            ["cloud.region"] = "eastus",
            ["k8s.pod.name"] = Environment.GetEnvironmentVariable("HOSTNAME")
        });
    });

Filtering Low-Value Telemetry

// Filter health check requests
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing =>
    {
        tracing.AddAspNetCoreInstrumentation(options =>
        {
            options.Filter = httpContext =>
            {
                // Exclude health checks and monitoring probes
                var path = httpContext.Request.Path.Value ?? "";
                return !path.StartsWith("/health") 
                    && !path.StartsWith("/metrics")
                    && !path.StartsWith("/ready")
                    && !path.StartsWith("/alive");
            };
        });
    });

Cost Optimization

Azure Monitor costs scale with data ingestion volume. Production deployments require strategic cost management through sampling, retention policies, and selective instrumentation.

Tiered Sampling Strategy

// Different sampling rates per environment and service tier
public class TieredSamplingStrategy
{
    public static double GetSamplingRatio(string environment, string serviceTier)
    {
        return (environment, serviceTier) switch
        {
            ("production", "critical") => 0.5,     // 50% for critical services
            ("production", "standard") => 0.1,     // 10% for standard services
            ("production", "background") => 0.01,  // 1% for background jobs
            ("staging", _) => 0.5,                 // 50% in staging
            ("development", _) => 1.0,             // 100% in development
            _ => 0.1
        };
    }
}

Daily Cap Configuration

Configure daily ingestion caps in Application Insights to prevent unexpected cost overruns. Set caps with appropriate margins and configure alerts when approaching limits.

Retention Policy Optimization

// Configure data retention in Application Insights
// 30 days for standard telemetry
// 90 days for critical business metrics
// 365 days for compliance-required data

// Use continuous export for long-term storage
// Export to Azure Storage for cost-effective archival

Operational Dashboards

Production operations require real-time visibility into system health. Build dashboards that surface key metrics, trends, and anomalies.

Service Health Dashboard Queries

// Request rate and error rate
requests
| where timestamp > ago(1h)
| summarize 
    RequestRate = count() / 60.0,
    ErrorRate = todouble(countif(success == false)) / todouble(count()) * 100
    by bin(timestamp, 1m)
| render timechart

// Latency distribution
requests
| where timestamp > ago(1h)
| summarize 
    P50 = percentile(duration, 50),
    P95 = percentile(duration, 95),
    P99 = percentile(duration, 99)
    by bin(timestamp, 5m)
| render timechart

// Dependency health
dependencies
| where timestamp > ago(1h)
| summarize 
    CallCount = count(),
    FailureCount = countif(success == false),
    AvgDuration = avg(duration)
    by name, type
| extend SuccessRate = 100.0 - (todouble(FailureCount) / todouble(CallCount) * 100)
| project name, type, CallCount, SuccessRate, AvgDuration
| order by FailureCount desc

Business Metrics Dashboard

// Orders processed trend
customMetrics
| where name == "orders.processed"
| where timestamp > ago(24h)
| summarize OrderCount = sum(value) by bin(timestamp, 1h)
| render timechart

// Revenue metrics
customMetrics
| where name == "order.value"
| where timestamp > ago(24h)
| summarize 
    TotalRevenue = sum(value),
    AverageOrderValue = avg(value),
    OrderCount = count()
    by bin(timestamp, 1h)
| render timechart

Incident Response Integration

Connect Azure Monitor alerts to incident management systems for automated escalation and tracking.

Action Group Configuration

Configure action groups with multiple notification channels (email, SMS, webhook)
Integrate with PagerDuty, ServiceNow, or Jira for incident tracking
Implement escalation policies for different severity levels
Use Logic Apps or Azure Functions for custom incident workflows
Configure runbooks for automated remediation of common issues

Continuous Improvement

Production observability requires ongoing refinement. Establish regular reviews of telemetry effectiveness, alert quality, and cost efficiency. Monitor alert fatigue through acknowledgment rates and time-to-resolution metrics. Adjust sampling strategies based on incident analysis and business requirements. Conduct quarterly reviews of instrumentation coverage and dashboard relevance.

Series Conclusion

This seven-part series covered comprehensive OpenTelemetry implementation with Azure Monitor, from foundational concepts through production-grade patterns. You now have the knowledge to instrument applications across .NET, Node.js, and Python, implement distributed tracing across microservices, create custom metrics for business intelligence, and optimize observability for production environments while managing costs.

The journey to effective observability continues beyond implementation. Modern systems evolve constantly, introducing new services, changing traffic patterns, and encountering novel failure modes. Your observability infrastructure must evolve alongside your applications, continuously adapting to new challenges while maintaining operational excellence.

Azure Monitor with OpenTelemetry Part 7: Production Monitoring and Observability Patterns

Sampling Strategies for Production

Head Sampling in .NET

Advanced Sampling with Custom Logic

Sampling Configuration for Node.js

Python Sampling Configuration

Intelligent Alerting Strategies

Error Rate Alerts

Latency Percentile Alerts

Custom Metric Alerts

Dependency Failure Alerts

Multi-Condition Composite Alert

Performance Optimization

Batch Export Configuration

Resource Attribute Optimization

Filtering Low-Value Telemetry

Cost Optimization

Tiered Sampling Strategy

Daily Cap Configuration

Retention Policy Optimization

Operational Dashboards

Service Health Dashboard Queries

Business Metrics Dashboard

Incident Response Integration

Action Group Configuration

Continuous Improvement

Series Conclusion

References

Like this:

You may like

Written by:

Chandan 542 Posts

You May Have Missed

The Complete Picture: Balancing Professional and Personal Support Systems

For Parents, Partners, and Friends: A Guide to Supporting Your Loved One in Tech

The HR Conversation: When and How to Involve HR in Your Mental Health Journey

Finding Your Tech Tribe: The Power of Peer Support Groups

How to whitelist website on AdBlocker?

Sampling Strategies for Production

Head Sampling in .NET

Advanced Sampling with Custom Logic

Sampling Configuration for Node.js

Python Sampling Configuration

Intelligent Alerting Strategies

Error Rate Alerts

Latency Percentile Alerts

Custom Metric Alerts

Dependency Failure Alerts

Multi-Condition Composite Alert

Performance Optimization

Batch Export Configuration

Resource Attribute Optimization

Filtering Low-Value Telemetry

Cost Optimization

Tiered Sampling Strategy

Daily Cap Configuration

Retention Policy Optimization

Operational Dashboards

Service Health Dashboard Queries

Business Metrics Dashboard

Incident Response Integration

Action Group Configuration

Continuous Improvement

Series Conclusion

References

Like this:

You may like

Written by:

Chandan 542 Posts

Related Posts

Production Operations and Distributed Deployment: Monitoring, Versioning, and Maintaining Edge AI at Scale

Azure AI Foundry with Anthropic Claude Part 7: Production Patterns – Monitoring, Security, and Optimization

Azure AI Foundry with Anthropic Claude Part 6: Claude Code + Azure DevOps Integration – Automated Development Workflows

You May Have Missed

The Complete Picture: Balancing Professional and Personal Support Systems

For Parents, Partners, and Friends: A Guide to Supporting Your Loved One in Tech

The HR Conversation: When and How to Involve HR in Your Mental Health Journey

Finding Your Tech Tribe: The Power of Peer Support Groups

How to whitelist website on AdBlocker?