A2A in Production: Observability, Governance and Scaling (Part 8 of 8)

A2A in Production: Observability, Governance and Scaling (Part 8 of 8)

Parts 1 through 7 built a complete, secured, multi-agent system using A2A and MCP. Everything works on a single machine with in-memory state. This final part takes it to production: distributed tracing across agent hops, structured logging that correlates MCP tool calls with the A2A task that triggered them, a Redis-backed task store for horizontal scaling, and a deployment pattern on Azure Container Apps.

These are not optional polish steps. Without observability you cannot debug failures that span three agents. Without persistent state you cannot scale beyond one instance. Without a deployment pattern you have a prototype, not a product.

The Production Readiness Gaps

Before adding anything, here is an honest list of what the current implementation is missing for production:

  • Observability: no distributed traces, no metrics, no structured logs that span agent hops
  • Persistent state: in-memory task store is lost on restart and cannot be shared across instances
  • Horizontal scaling: SSE connections are pinned to a single instance
  • Deployment: no container definitions, no health checks, no environment management
  • Graceful degradation: no circuit breakers, no timeout enforcement at the task level

Distributed Tracing with OpenTelemetry

The hardest debugging problem in multi-agent systems is tracing a failure across multiple agents. When a task fails in the Procurement Agent, was the problem caused by a bad input from the Orchestrator, a timeout in the ERP MCP server, or an internal logic error? Without a trace that spans all three, you are guessing.

OpenTelemetry is the standard solution. It propagates a trace context through every hop so that one trace in your observability platform shows the complete end-to-end picture.

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http @opentelemetry/sdk-trace-base \
  @opentelemetry/api
// src/instrumentation.js - load this before everything else

import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import {
  SEMRESATTRS_SERVICE_NAME,
  SEMRESATTRS_SERVICE_VERSION,
} from "@opentelemetry/semantic-conventions";

const exporter = new OTLPTraceExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/traces",
});

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]:    process.env.SERVICE_NAME    || "inventory-agent",
    [SEMRESATTRS_SERVICE_VERSION]: process.env.SERVICE_VERSION || "1.0.0",
  }),
  traceExporter: exporter,
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-fs": { enabled: false },
    }),
  ],
});

sdk.start();
process.on("SIGTERM", () => sdk.shutdown());
// server.js - first import must be instrumentation
import "./src/instrumentation.js";
import express from "express";
// ... rest of imports

Propagating Trace Context Across A2A Hops

Auto-instrumentation handles HTTP spans automatically, but you still need to propagate the W3C trace context headers so spans in different agents attach to the same root trace. Update the A2A client to inject these headers on every outgoing request:

// src/a2aClient.js - trace-aware headers and spans

import { context, propagation, trace } from "@opentelemetry/api";

async _headers(agentUrl) {
  const token   = await this._tokenManager.getToken(this._scopeMap[agentUrl] || "");
  const headers = {
    "Content-Type":  "application/json",
    "Authorization": `Bearer ${token}`,
  };
  // Inject W3C traceparent + tracestate
  propagation.inject(context.active(), headers);
  return headers;
}

async sendTask(agentUrl, taskId, message, pushNotification = null) {
  const tracer = trace.getTracer("a2a-client");
  return tracer.startActiveSpan(
    `a2a tasks/send -> ${new URL(agentUrl).hostname}`,
    async (span) => {
      span.setAttributes({ "a2a.task_id": taskId, "a2a.agent_url": agentUrl });
      try {
        const params = { id: taskId, message };
        if (pushNotification) params.pushNotification = pushNotification;

        const response = await fetch(agentUrl, {
          method:  "POST",
          headers: await this._headers(agentUrl),
          body:    this._rpcBody("tasks/send", params),
        });

        const data = await response.json();
        if (data.error) throw new Error(`A2A error ${data.error.code}: ${data.error.message}`);
        span.setStatus({ code: 1 });
        return data.result;
      } catch (err) {
        span.recordException(err);
        span.setStatus({ code: 2, message: err.message });
        throw err;
      } finally {
        span.end();
      }
    }
  );
}

On the receiving agent, extract the incoming trace context in middleware so all child spans connect back to the originating trace:

// src/middleware/traceContext.js

import { context, propagation, trace } from "@opentelemetry/api";

export function extractTraceContext(req, res, next) {
  const parentContext = propagation.extract(context.active(), req.headers);

  context.with(parentContext, () => {
    const tracer = trace.getTracer("a2a-server");
    tracer.startActiveSpan(
      `a2a.${req.body?.method || "request"} [${req.body?.params?.id || ""}]`,
      (span) => {
        span.setAttributes({
          "a2a.method":     req.body?.method,
          "a2a.task_id":    req.body?.params?.id,
          "a2a.agent_name": process.env.SERVICE_NAME || "agent",
        });
        req.activeSpan = span;
        res.on("finish", () => {
          span.setStatus({ code: res.statusCode < 400 ? 1 : 2 });
          span.end();
        });
        next();
      }
    );
  });
}

With propagation in place, a single trace in Azure Monitor, Jaeger, or Tempo spans the entire workflow:

gantt
    title Distributed Trace - Supply Chain Workflow (trace ID: 4bf92f35...)
    dateFormat x
    axisFormat %Lms

    section Orchestrator
    Workflow root span           :0, 1800
    delegate inventory check     :10, 900
    delegate procurement         :950, 1790

    section Inventory Agent
    a2a tasks/send               :50, 850
    MCP get_low_stock_items      :200, 600

    section Procurement Agent
    a2a tasks/send               :1000, 1750
    MCP create_purchase_order    :1100, 1500
    MCP send_po_notification     :1520, 1720

Structured Logging with Trace Correlation

npm install pino
// src/logger.js

import pino from "pino";
import { trace } from "@opentelemetry/api";

const base = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: { level: (label) => ({ level: label }) },
  timestamp: pino.stdTimeFunctions.isoTime,
  base: {
    service: process.env.SERVICE_NAME    || "a2a-agent",
    version: process.env.SERVICE_VERSION || "1.0.0",
  },
});

function withTrace(data = {}) {
  const spanCtx = trace.getActiveSpan()?.spanContext();
  if (!spanCtx) return data;
  return { ...data, traceId: spanCtx.traceId, spanId: spanCtx.spanId };
}

export const logger = {
  info:  (msg, data) => base.info(withTrace(data),  msg),
  warn:  (msg, data) => base.warn(withTrace(data),  msg),
  error: (msg, data) => base.error(withTrace(data), msg),
  debug: (msg, data) => base.debug(withTrace(data), msg),
};

Every log line in production carries the trace and span IDs automatically:

{
  "level": "info",
  "time": "2026-03-13T07:12:44.231Z",
  "service": "inventory-agent",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "taskId": "task-abc-001",
  "skus": ["SKU-12345", "SKU-67890"],
  "msg": "Querying inventory database via MCP"
}

Redis-Backed Task Store

The in-memory task store from Part 3 works perfectly on one instance. The moment you run two instances behind a load balancer, tasks created on instance A are invisible to instance B. Replace it with Redis so any instance can read and write any task.

npm install ioredis
// src/taskStoreRedis.js

import Redis from "ioredis";

const redis = new Redis({
  host:     process.env.REDIS_HOST     || "localhost",
  port:     parseInt(process.env.REDIS_PORT || "6379"),
  password: process.env.REDIS_PASSWORD || undefined,
  tls:      process.env.REDIS_TLS === "true" ? {} : undefined,
  keyPrefix: "a2a:task:",
  retryStrategy: (times) => Math.min(times * 100, 3000),
});

const TASK_TTL_SECONDS = 60 * 60 * 24; // 24 hours
const TERMINAL_STATES  = new Set(["completed", "failed", "canceled"]);

// In-memory SSE subscriber map - local to each instance
const sseSubscribers = new Map();

// Redis pub/sub channel for broadcasting state updates across instances
const publisher  = redis;
const subscriber = redis.duplicate();

await subscriber.subscribe("a2a:events");
subscriber.on("message", (channel, raw) => {
  const event = JSON.parse(raw);
  broadcastLocalSse(event.taskId, event);
});

export async function createTask(taskId, initialMessage, pushNotification = null) {
  const task = {
    id: taskId,
    status:          { state: "submitted", timestamp: new Date().toISOString() },
    messages:        [initialMessage],
    artifacts:       [],
    pushNotification,
    history:         [{ state: "submitted", timestamp: new Date().toISOString() }],
  };
  await redis.setex(taskId, TASK_TTL_SECONDS, JSON.stringify(task));
  sseSubscribers.set(taskId, new Set());
  return task;
}

export async function getTask(taskId) {
  const raw = await redis.get(taskId);
  return raw ? JSON.parse(raw) : null;
}

export async function updateTaskState(taskId, newState, message = null) {
  const task = await getTask(taskId);
  if (!task) throw new Error(`Task ${taskId} not found`);

  const timestamp = new Date().toISOString();
  task.status = { state: newState, timestamp };
  if (message) task.messages.push(message);
  task.history.push({ state: newState, timestamp });

  await redis.setex(taskId, TASK_TTL_SECONDS, JSON.stringify(task));

  const event = { type: "taskStatusUpdate", taskId, status: { ...task.status, message } };
  await publisher.publish("a2a:events", JSON.stringify(event));
}

export async function addArtifact(taskId, artifact, lastChunk = true) {
  const task = await getTask(taskId);
  if (!task) throw new Error(`Task ${taskId} not found`);

  task.artifacts.push(artifact);
  await redis.setex(taskId, TASK_TTL_SECONDS, JSON.stringify(task));

  const event = { type: "taskArtifactUpdate", taskId, artifact: { ...artifact, lastChunk } };
  await publisher.publish("a2a:events", JSON.stringify(event));
}

export function addSseClient(taskId, callback) {
  if (!sseSubscribers.has(taskId)) sseSubscribers.set(taskId, new Set());
  sseSubscribers.get(taskId).add(callback);
}

export function removeSseClient(taskId, callback) {
  sseSubscribers.get(taskId)?.delete(callback);
}

export function isTerminal(state) {
  return TERMINAL_STATES.has(state);
}

function broadcastLocalSse(taskId, event) {
  const clients = sseSubscribers.get(taskId);
  if (!clients) return;
  for (const send of clients) {
    try { send(event); } catch { clients.delete(send); }
  }
}

The key design here is that Redis pub/sub handles cross-instance SSE broadcasting. When instance A updates a task state, it publishes to the a2a:events channel. Every instance, including A, receives the message and broadcasts it to any locally-connected SSE clients. This means SSE clients can connect to any instance and still receive all updates regardless of which instance processes the task.

flowchart LR
    C1[Client SSE\nconnected to Instance A]
    C2[Client SSE\nconnected to Instance B]

    subgraph InstanceA["Instance A"]
        IA[Task Handler\nprocessing task-001]
    end

    subgraph InstanceB["Instance B"]
        IB[SSE Manager\nlocal subscribers]
    end

    subgraph Redis
        RDB[(Task State\nRedis Hash)]
        PUB[Pub/Sub\na2a:events channel]
    end

    IA -->|setex task-001| RDB
    IA -->|publish event| PUB
    PUB -->|message| InstanceA
    PUB -->|message| InstanceB
    InstanceA -->|write SSE| C1
    InstanceB -->|write SSE| C2

Task Timeout Enforcement

Long-running tasks that never complete or fail can leak Redis keys and leave SSE clients hanging. A timeout wrapper ensures every task resolves within a configurable deadline:

// src/taskRunner.js

import { updateTaskState, isTerminal, getTask } from "./taskStoreRedis.js";
import { processTask } from "./taskHandler.js";
import logger from "./logger.js";

const DEFAULT_TIMEOUT_MS = parseInt(process.env.TASK_TIMEOUT_MS || "300000"); // 5 minutes

export async function runTaskWithTimeout(taskId, timeoutMs = DEFAULT_TIMEOUT_MS) {
  const timeoutHandle = setTimeout(async () => {
    const task = await getTask(taskId);
    if (!task || isTerminal(task.status.state)) return;

    logger.warn("Task timed out, marking as failed", { taskId, timeoutMs });
    await updateTaskState(taskId, "failed", {
      role:  "agent",
      parts: [{ type: "text", text: `Task exceeded timeout of ${timeoutMs / 1000}s and was terminated.` }],
    });
  }, timeoutMs);

  try {
    await processTask(taskId);
  } finally {
    clearTimeout(timeoutHandle);
  }
}

Health Check and Readiness Probe

Azure Container Apps and Kubernetes both use health and readiness probes to decide when to route traffic to an instance. A meaningful health check verifies both the HTTP server and the Redis connection:

// src/routes/health.js

import { Router } from "express";
import redis from "../taskStoreRedis.js"; // export the redis client

const router = Router();

router.get("/health", async (req, res) => {
  const checks = { status: "ok", redis: "unknown", uptime: process.uptime() };

  try {
    await redis.ping();
    checks.redis = "ok";
  } catch (err) {
    checks.redis = "error";
    checks.status = "degraded";
  }

  const httpStatus = checks.status === "ok" ? 200 : 503;
  res.status(httpStatus).json(checks);
});

// Readiness - returns 503 until MCP and Redis are connected
router.get("/ready", async (req, res) => {
  try {
    await redis.ping();
    res.json({ ready: true });
  } catch {
    res.status(503).json({ ready: false, reason: "Redis not connected" });
  }
});

export default router;

Containerization

# Dockerfile

FROM node:22-alpine AS base
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev

FROM base AS production
COPY src ./src
COPY mcp-servers ./mcp-servers
COPY server.js .

# Run as non-root user
RUN addgroup -S agent && adduser -S agent -G agent
USER agent

EXPOSE 3000

# Health check for Docker / Container Apps
HEALTHCHECK --interval=30s --timeout=10s --start-period=15s --retries=3 \
  CMD wget -qO- http://localhost:3000/health || exit 1

CMD ["node", "--import", "./src/instrumentation.js", "server.js"]

Azure Container Apps Deployment

Azure Container Apps is the right deployment target for A2A agents. It handles scaling to zero, HTTP-triggered scaling, secrets management via Key Vault references, and built-in DAPR integration if you later want distributed actor support.

# 1. Create the Container Apps environment
az containerapp env create \
  --name a2a-agents-env \
  --resource-group a2a-rg \
  --location eastus

# 2. Create Azure Cache for Redis
az redis create \
  --name a2a-redis \
  --resource-group a2a-rg \
  --location eastus \
  --sku Basic \
  --vm-size C0

# 3. Deploy the Inventory Agent
az containerapp create \
  --name inventory-agent \
  --resource-group a2a-rg \
  --environment a2a-agents-env \
  --image your-registry.azurecr.io/inventory-agent:latest \
  --target-port 3000 \
  --ingress external \
  --min-replicas 1 \
  --max-replicas 10 \
  --scale-rule-name http-scaling \
  --scale-rule-type http \
  --scale-rule-http-concurrency 50 \
  --secrets \
    redis-password=keyvaultref:https://your-kv.vault.azure.net/secrets/redis-password \
    api-token=keyvaultref:https://your-kv.vault.azure.net/secrets/api-token \
    azure-client-secret=keyvaultref:https://your-kv.vault.azure.net/secrets/client-secret \
  --env-vars \
    SERVICE_NAME=inventory-agent \
    SERVICE_VERSION=1.0.0 \
    REDIS_HOST=a2a-redis.redis.cache.windows.net \
    REDIS_PORT=6380 \
    REDIS_TLS=true \
    REDIS_PASSWORD=secretref:redis-password \
    API_TOKEN=secretref:api-token \
    AZURE_CLIENT_SECRET=secretref:azure-client-secret \
    OTEL_EXPORTER_OTLP_ENDPOINT=https://your-otel-collector.example.com/v1/traces \
    TASK_TIMEOUT_MS=300000

The AGENT_URL environment variable for each deployed agent should point to its Container Apps ingress URL. The orchestrator's scope map uses these URLs for token acquisition and routing:

// orchestrator - production configuration
const scopeMap = {
  "https://inventory-agent.your-env.eastus.azurecontainerapps.io": "api://inventory-agent/.default",
  "https://procurement-agent.your-env.eastus.azurecontainerapps.io": "api://procurement-agent/.default",
};

Production Architecture Overview

flowchart TD
    subgraph AzureContainerApps["Azure Container Apps Environment"]
        subgraph OrchestratorCA["Orchestrator (1-5 replicas)"]
            OA1[Instance 1]
            OA2[Instance 2]
        end

        subgraph InventoryCA["Inventory Agent (1-10 replicas)"]
            IA1[Instance 1]
            IA2[Instance 2]
        end

        subgraph ProcurementCA["Procurement Agent (1-5 replicas)"]
            PA1[Instance 1]
        end
    end

    subgraph SharedInfra["Shared Infrastructure"]
        REDIS[(Azure Cache\nfor Redis)]
        ACR[Azure Container\nRegistry]
        KV[Azure Key Vault\nSecrets]
        AM[Azure Monitor\nTraces + Logs]
        ENTRA[Microsoft Entra ID\nJWT + OAuth2]
    end

    OA1 -->|A2A + JWT| IA1
    OA2 -->|A2A + JWT| IA2
    OA1 -->|A2A + JWT| PA1

    IA1 -->|task state| REDIS
    IA2 -->|task state| REDIS
    REDIS -->|pub/sub SSE events| IA1
    REDIS -->|pub/sub SSE events| IA2

    OA1 -->|OTLP traces| AM
    IA1 -->|OTLP traces| AM
    PA1 -->|OTLP traces| AM

    KV -->|secrets| OrchestratorCA
    KV -->|secrets| InventoryCA
    ENTRA -->|JWKS| InventoryCA
    ENTRA -->|tokens| OrchestratorCA

Production Runbook: Debugging a Failed Task

With observability in place, here is the exact process for debugging a task that failed in production:

Step 1: Find the trace. Get the Task ID from the error report or the client. Search Azure Monitor logs for that Task ID:

// Azure Monitor Log Analytics query
traces
| where customDimensions.taskId == "task-abc-001"
| order by timestamp asc
| project timestamp, cloud_RoleName, message, customDimensions

Step 2: Get the trace ID from the first log entry and open the distributed trace. In Azure Monitor Application Insights, navigate to the trace. You will see every span across the Orchestrator, Inventory Agent, and any MCP tool calls, with timing and error details on each.

Step 3: Check the task state in Redis directly if needed.

redis-cli -h a2a-redis.redis.cache.windows.net -p 6380 --tls -a $REDIS_PASSWORD \
  GET "a2a:task:task-abc-001" | python3 -m json.tool

Step 4: Check the audit log for the RBAC decision if the failure was a 403.

traces
| where customDimensions.taskId == "task-abc-001"
| where customDimensions.rpcError == "-32001"
| project timestamp, customDimensions.agentId, customDimensions.scopes, customDimensions.detectedSkill

Complete Production Checklist

Before calling your A2A system production-ready, verify every item across all eight parts of this series:

  • Agent Cards served at /.well-known/agent.json and signed with RS256
  • All A2A endpoints behind TLS 1.3 (mTLS for high-security environments)
  • JWT verification using JWKS endpoint with issuer and audience validation
  • RBAC enforced per skill with minimum required OAuth2 scopes
  • Rate limiting on all task endpoints
  • Structured audit logging on every request including access denials
  • OpenTelemetry instrumentation with W3C trace context propagation across all agents
  • Trace correlation in all structured log entries
  • Task state persisted to Redis with TTL and pub/sub cross-instance SSE broadcasting
  • Task timeout enforcement with automatic failure on deadline exceeded
  • Health and readiness probes that verify downstream dependencies
  • Containerized with non-root user and Docker health check
  • Secrets stored in Azure Key Vault, never in environment files
  • MCP connections established before accepting A2A traffic, with graceful shutdown
  • HTTP-triggered autoscaling configured on Container Apps

Series Complete

Over eight parts, we built a complete enterprise multi-agent system from protocol fundamentals to production deployment. Here is the full path we covered:

  1. Part 1: What is A2A, why it exists, and how it compares to MCP
  2. Part 2: Agent Cards, JSON-RPC message formats, task lifecycle, and SSE streaming
  3. Part 3: Complete A2A agent server in Node.js
  4. Part 4: The same agent server in Python (FastAPI) and C# (ASP.NET Core)
  5. Part 5: Orchestrator with Agent Card fetching, skill-based routing, and multi-turn handling
  6. Part 6: JWT verification, OAuth2 client credentials, mTLS, RBAC, and Agent Card signing
  7. Part 7: MCP and A2A together in a unified agentic stack
  8. Part 8: OpenTelemetry tracing, structured logging, Redis scaling, and Azure deployment

Everything in this series targets the A2A specification as it stands in March 2026 at version 0.3. The protocol is still evolving under Linux Foundation governance. The GitHub repository at github.com/a2aproject/A2A is the authoritative source for specification updates, and the community is active. If you find gaps or have improvements, contributions are open.

References

Written by:

584 Posts

View All Posts
Follow Me :
How to whitelist website on AdBlocker?

How to whitelist website on AdBlocker?

  1. 1 Click on the AdBlock Plus icon on the top right corner of your browser
  2. 2 Click on "Enabled on this site" from the AdBlock Plus option
  3. 3 Refresh the page and start browsing the site