Azure AI Foundry Deep Dive Series Part 2: Building Production AI Applications with Enterprise Architecture → Explore with me!

Building a proof of concept is one thing. Taking that POC to production is an entirely different challenge. In this post, we’ll walk through the architecture patterns, design decisions, and implementation strategies for deploying production-ready AI applications using Azure AI Foundry. We’ll cover everything from basic chat applications to enterprise-grade systems with full security, observability, and cost controls.

Understanding Production Requirements

Production AI applications have requirements that go far beyond getting a model to return responses. You need reliability guarantees, security controls, cost optimization, observability, and compliance. Azure AI Foundry provides two reference architectures that address these needs at different levels: a basic architecture for learning and proof of concepts, and a baseline architecture for production deployments.

The gap between these two architectures reveals what production systems actually require. Let’s start with the basics and progressively build up to enterprise-grade patterns.

Basic Architecture: Learning the Components

The basic Azure AI Foundry chat architecture provides a foundation for understanding how the pieces fit together. It includes a client UI running in Azure App Service, with an agent hosted in Foundry Agent Service orchestrating the workflow from incoming prompts to data stores.

Here’s how the request flow works. A user interacts with the web application via HTTPS to the App Service default domain on azurewebsites.net. Azure fully manages the TLS certificate. Easy Auth ensures authentication via Microsoft Entra ID before the request reaches your application code.

When the web application receives a user query, it invokes the agent through the Azure AI Agent SDK. The agent processes the request based on its system prompt and has access to a configured language model, connected tools, and knowledge stores. For retrieval-augmented generation patterns, the agent connects to Azure AI Search to fetch relevant indexed data.

Foundry Agent Service then connects to an Azure OpenAI model deployed in Foundry Models and sends the prompt with relevant grounding data and chat context. Application Insights logs information about the original request and agent interactions for debugging and monitoring.

This architecture runs in a single region and uses the App Service Basic tier. It deploys models using the Global Standard configuration with pay-as-you-go pricing. Foundry Agent Service runs as a fully Microsoft-hosted solution where Microsoft manages dependent services like Cosmos DB, Storage, and AI Search on your behalf.

What the Basic Architecture Doesn’t Include

Understanding what’s missing from the basic architecture helps clarify what production systems require. The basic tier lacks availability zone support, meaning your instance becomes unavailable if there are problems with the instance, rack, or datacenter. There’s no autoscaling configured, so you need to overprovision compute to avoid capacity issues.

Security measures are minimal. Public endpoints remain exposed without Web Application Firewall protection or DDoS protection. Network isolation doesn’t exist, as all communication happens over public networks. There are no Azure Policy configurations enforcing governance or compliance requirements.

Cost controls are absent. Without budgets, alerts, or usage monitoring configured, you risk unexpected bills. The architecture assumes limited model calls without mechanisms to prevent overruns.

Baseline Production Architecture

The baseline architecture addresses all these gaps with enterprise-grade security, compliance, and control. This is what you actually deploy to production. The architecture uses the Foundry Agent Service standard setup where you bring your own network for network isolation and your own Azure resources to store chat and agent state.

Network Architecture and Security

All communication between application components and Azure services occurs over private endpoints, ensuring data traffic remains within your workload’s virtual network. Outbound traffic from agents strictly routes through Azure Firewall, which enforces egress rules.

The request flow starts with Azure Application Gateway as the entry point. Web Application Firewall inspects these requests before forwarding them to the backend App Service. This provides protection against common web vulnerabilities and attacks.

The web application communicates with the agent via private endpoint and authenticates to Azure AI Foundry using managed identity. No credentials are stored in your application code or configuration. The agent connects to knowledge stores like Azure AI Search in the private network via private endpoints.

Azure Firewall serves as the egress control point, allowing you to enforce strict rules about what external services your agents can access. This becomes critical when agents need to call external APIs or tools, as you can whitelist specific endpoints while blocking everything else.

High Availability and Reliability

The baseline architecture leverages Azure availability zones for all components that support them. App Service deploys across multiple zones, ensuring your application remains available even if an entire datacenter fails. Azure Application Gateway similarly spans zones for resilient traffic routing.

Autoscaling configuration allows your application to handle traffic spikes without manual intervention. Set minimum and maximum instance counts based on your expected load patterns. Configure scale-out rules based on CPU utilization, memory pressure, or custom metrics like request queue length.

For the standard agent setup, you provision and manage service dependencies in your own Azure subscription. This includes Azure Cosmos DB for agent state, Azure Storage for files and conversation history, and Azure AI Search for knowledge indexing. Having dedicated resources gives you full control over capacity planning and performance tuning.

Identity and Access Management

Managed identities eliminate the need for credentials in your code. Your App Service has a system-assigned managed identity that authenticates to Azure AI Foundry. The Foundry project uses its managed identity to access the deployed Azure OpenAI model.

Azure RBAC controls who can perform management operations versus data plane operations. Management operations like creating deployments and projects require control plane permissions. Building agents, running evaluations, and uploading files require data plane permissions. This separation ensures developers can work on agents without having permission to modify infrastructure.

Azure Policy integration enforces organizational standards. Common policies include requiring private endpoints for all Azure AI services, mandating customer-managed encryption keys, blocking preview models in production, and enforcing specific networking configurations.

Architecture Diagram Representation

Let me provide a diagram showing the baseline production architecture:

flowchart TB
    subgraph Internet
        User[User/Client]
    end
    
    subgraph Azure["Azure Subscription"]
        subgraph FrontEnd["Front-End Tier"]
            AppGW[Azure Application Gatewaywith WAF]
            AppService[Azure App ServiceChat UI]
        end
        
        subgraph AgentTier["Agent Orchestration Tier"]
            Foundry[Azure AI FoundryProject]
            AgentService[Foundry Agent ServiceStandard Setup]
        end
        
        subgraph DataTier["Data and Knowledge Tier"]
            CosmosDB[(Azure Cosmos DBAgent State)]
            Storage[(Azure StorageFiles & History)]
            AISearch[(Azure AI SearchVector Store)]
        end
        
        subgraph ModelTier["Model Tier"]
            OpenAI[Azure OpenAIGPT Models]
            Claude[Anthropic ClaudeModels]
        end
        
        subgraph NetworkSec["Network Security"]
            Firewall[Azure FirewallEgress Control]
            PrivateLink[Private Endpoints]
        end
        
        subgraph Monitoring["Monitoring & Observability"]
            AppInsights[Application Insights]
            Monitor[Azure Monitor]
        end
    end
    
    User -->|HTTPS| AppGW
    AppGW -->|Private Network| AppService
    AppService -->|Private Endpoint| Foundry
    Foundry -->|Managed Identity| AgentService
    AgentService -->|Private Endpoint| CosmosDB
    AgentService -->|Private Endpoint| Storage
    AgentService -->|Private Endpoint| AISearch
    AgentService -->|Private Endpoint| OpenAI
    AgentService -->|Private Endpoint| Claude
    AgentService -->|Via Firewall| Firewall
    AppService -.->|Telemetry| AppInsights
    AgentService -.->|Metrics| Monitor
    Firewall -->|Controlled Egress| Internet

This diagram illustrates the complete production architecture with all security boundaries and communication paths clearly defined.

Cost Optimization Strategies

Production systems need cost controls to prevent budget overruns. The most expensive components in a typical deployment are Azure Cosmos DB, Azure AI Search, and DDoS Protection. Other notable costs include the chat UI compute and Application Gateway.

Right-Sizing Compute Resources

Start with the smallest SKU that meets your performance requirements and scale up based on actual usage. For App Service, begin with the S1 tier for production workloads. It provides availability zone support and sufficient resources for moderate traffic.

Configure autoscaling with appropriate thresholds. Set your minimum instance count to handle baseline traffic without cold starts. Set your maximum based on cost constraints rather than unlimited scaling. Monitor your scaling metrics to identify optimization opportunities.

Model Deployment Optimization

Use pay-as-you-go pricing for unpredictable workloads where usage varies significantly. Switch to provisioned throughput when usage patterns become stable and predictable. The break-even point typically occurs around 40% sustained utilization of purchased capacity.

Consider combining both deployment types. Establish a reliable baseline with provisioned throughput units (PTUs) and handle traffic spikes with pay-as-you-go capacity. This hybrid approach optimizes for both cost and performance.

The model router capability helps optimize costs automatically. Configure it to route simple queries to smaller, cheaper models like GPT-4o mini while sending complex reasoning tasks to frontier models like GPT-5 or Claude Opus. This can reduce token costs by 60-80% for workloads with mixed complexity.

Storage and Database Optimization

Foundry Agent Service manages the request unit allocation on Azure Cosmos DB. To reduce long-term costs, purchase reserved capacity aligned with your expected usage duration and volume. Reserved capacity requires upfront commitment but provides significant discounts.

For Azure AI Search, choose the appropriate tier based on your document volume and query patterns. The Basic tier handles up to 2GB of indexed data, sufficient for many applications. Standard tiers provide more capacity and replica options for high-availability scenarios.

Implement lifecycle policies on Azure Storage to automatically tier or delete old conversation histories. Most applications don’t need to retain every chat message indefinitely. Archive conversations older than 90 days to cool or archive storage tiers.

Monitoring and Budget Controls

Use Azure Cost Management to set budgets and create alerts for anomalies. Configure alerts at 50%, 80%, and 100% of your budget threshold. This gives you advance warning before costs spiral out of control.

Track token usage through model telemetry. Analyze which queries consume the most tokens and optimize prompts to reduce costs. Sometimes rephrasing a system prompt or adjusting temperature settings can cut token usage by 20-30% without impacting quality.

Restrict playground usage to preproduction environments only. Developers experimenting in the Foundry portal can rack up costs quickly. Provide sandbox projects with spending limits for experimentation separate from production resources.

Operational Excellence

Production systems require operational processes that keep them running reliably. The baseline architecture includes several components that support operational excellence.

Monitoring and Observability

Application Insights provides distributed tracing for requests flowing through your system. You can see exactly how long each component takes to process a request and identify bottlenecks. Custom telemetry tracks business-specific metrics like successful agent interactions or knowledge retrieval performance.

Azure Monitor metrics are segmented by scope. Management and usage metrics appear at the top-level Foundry resource, while project-specific metrics like evaluation performance or agent activity are scoped to individual projects. This separation helps teams focus on relevant data without noise from other projects.

Configure alerts for critical conditions. Monitor model error rates, API throttling, agent timeout rates, and cost anomalies. Set up action groups to notify the appropriate teams when alerts fire.

Deployment and Release Management

Infrastructure as Code enables repeatable deployments. Use Bicep or Terraform to define your entire architecture. Store templates in source control and review changes through pull requests before applying them.

Implement blue-green deployments for zero-downtime updates. Deploy the new version to a staging slot in App Service, validate it works correctly, then swap slots to make it live. If issues appear, swap back to the previous version instantly.

Separate environments for development, staging, and production ensure you catch issues before they affect users. Use Azure landing zones to standardize environment creation and enforce consistent policies across all environments.

Security Considerations

Security permeates every layer of the production architecture. Beyond the network isolation and private endpoints already discussed, several additional controls strengthen your security posture.

Data Protection

By default, Azure services use Microsoft-managed encryption keys to encrypt data in transit and at rest using FIPS 140-2 compliant 256-bit AES encryption. For additional control, bring your own keys stored in Azure Key Vault. Customer-managed keys give you the ability to rotate keys, revoke access, and meet specific compliance requirements.

With the standard agent setup, you can bring your own storage for thread and message data. This ensures data is isolated by project within your storage account rather than shared Microsoft-managed multi-tenant storage.

Content Safety

Foundry Agent Service enforces content safety automatically. It screens both user inputs and model outputs for harmful content categories including violence, hate speech, sexual content, and self-harm. Configure content filters with appropriate severity thresholds for your application.

Azure Content Safety provides additional capabilities for detecting jailbreak attempts, protected material detection, and prompt shields. Integrate these services to strengthen your defenses against adversarial inputs.

Compliance and Governance

Azure Policy enforces organizational standards at scale. Deploy policies that require specific configurations across all AI resources. Common policies include mandating private endpoints, requiring specific encryption settings, blocking certain model types in production, and enforcing tagging for cost allocation.

Use Azure RBAC to implement least-privilege access. Create custom roles that grant exactly the permissions needed for specific tasks. Regularly audit role assignments to ensure they remain appropriate.

Multi-Region Considerations

While the baseline architecture runs in a single region, production systems serving global users often require multi-region deployments. Azure Traffic Manager or Azure Front Door can route users to the nearest region for optimal latency.

Deploy independent instances of the entire architecture in each region. Use Azure Cosmos DB global distribution to replicate agent state across regions with configurable consistency levels. Consider eventual consistency for better performance in geographically distributed scenarios.

Model availability varies by region. Not all Azure OpenAI models are available in every region. Plan your regional deployments considering model availability, data residency requirements, and local compliance regulations.

What’s Next

We’ve covered the complete architecture for production AI applications on Azure AI Foundry. In the next post, we’ll dive into integrating multiple model providers including OpenAI and Anthropic Claude. We’ll explore the model router in detail, examine strategies for choosing the right model for each task, and implement fallback patterns when primary models are unavailable.

Building production systems requires attention to countless details. Azure AI Foundry provides the building blocks, but success depends on how you assemble them. The patterns we’ve discussed here apply across industries and use cases, giving you a solid foundation for your AI applications.

Azure AI Foundry Deep Dive Series Part 2: Building Production AI Applications with Enterprise Architecture

Understanding Production Requirements

Basic Architecture: Learning the Components

What the Basic Architecture Doesn’t Include