Azure AI Foundry has become the first cloud platform to offer both OpenAI GPT and Anthropic Claude models under one roof. This isn’t just about having more options. It’s about intelligently routing requests to the right model for each task, optimizing costs while maintaining quality, and building systems that can adapt as new models become available. In this post, we’ll explore how to integrate multiple model providers effectively and leverage the model router to automate these decisions.
The Multi-Model Reality
Different models excel at different tasks. GPT-5 provides exceptional multimodal reasoning and broad knowledge. Claude Sonnet 4.5 dominates in coding and long-form text analysis. Claude Haiku 4.5 delivers near-frontier performance at one-third the price of Sonnet. Cohere specializes in enterprise search and retrieval. The challenge isn’t choosing one model. It’s orchestrating multiple models to handle diverse workloads efficiently.
Traditional approaches force you to hard-code model selection in your application. This creates brittleness. When new models arrive or pricing changes, you need to update code across your entire system. Model router solves this by making model selection a runtime decision based on prompt characteristics, cost constraints, and performance requirements.
Available Models in Azure AI Foundry
Azure AI Foundry provides access to over 11,000 models, but let’s focus on the frontier models you’ll use most often for production applications.
OpenAI Models
The GPT family includes several models optimized for different scenarios. GPT-5 represents the latest generation with enhanced reasoning, multimodal capabilities supporting both text and vision input, and a massive 200K token context window. It uses 60% fewer reasoning tokens than its predecessor, improving both latency and cost.
GPT-4o remains available for workloads requiring strong performance at lower costs than GPT-5. GPT-4o mini provides an economical option for high-volume scenarios where frontier capabilities aren’t required.
The o-series models (o1, o3) focus on extended reasoning for complex problem-solving. These models spend more time thinking before responding, making them ideal for mathematics, coding challenges, and analytical tasks where accuracy matters more than speed.
Anthropic Claude Models
Claude Sonnet 4.5 excels at coding, agentic workflows, and multi-step reasoning. It has become the preferred model for autonomous coding tasks and complex agent orchestration. Organizations building AI coding assistants or agents that need to coordinate multiple steps consistently choose Sonnet.
Claude Opus 4.1 handles specialized reasoning and long-horizon problem-solving. Use it for tasks requiring deep analysis over extended context, such as legal research, scientific analysis, or financial modeling. Opus provides the most thorough reasoning in the Claude family.
Claude Haiku 4.5 offers the fastest response times at the lowest cost. It delivers near-frontier performance for customer support, content moderation, real-time coding assistance, and sub-agent orchestration. For high-volume applications where cost matters, Haiku can reduce operating expenses by 70% compared to Sonnet.
Other Notable Models
Cohere models provide enterprise-grade retrieval and classification capabilities. Their Command R family excels at RAG scenarios with superior citation accuracy.
DeepSeek R1 offers capabilities similar to GPT models at significantly lower costs, making it attractive for budget-conscious deployments.
Meta Llama models provide open-source alternatives with commercial licenses, enabling on-premises deployment scenarios.
Model Router Architecture
Model router acts as an intelligent proxy between your application and the underlying models. When your application sends a request, the router analyzes the prompt, evaluates available models, and selects the best option based on your configured criteria.
The router considers several factors when making decisions. Prompt complexity gets analyzed to determine if a task needs frontier reasoning or can be handled by a smaller model. Cost constraints ensure you stay within budget by preferring economical models when quality differences are negligible. Performance requirements balance latency against quality when real-time responses matter.
Model availability provides automatic fallback if the primary model reaches capacity limits or becomes temporarily unavailable.
How Model Selection Works
The router uses heuristics and learned patterns to classify prompts. Simple queries like factual questions or basic formatting get routed to efficient models like GPT-4o mini or Claude Haiku. Complex reasoning tasks involving mathematics, multi-step logic, or deep analysis go to frontier models like GPT-5 or Claude Opus.
Coding tasks prefer Claude Sonnet due to its superior code generation and debugging capabilities. Long-form content generation balances between Claude models for nuanced writing and GPT models for broad creative tasks.
The router learns from actual usage patterns. As your application processes requests, the system tracks which model selections produce the best results and adjusts its decision-making accordingly.
Model Router Implementation
Let’s implement model router in a practical application. We’ll build a system that uses multiple models intelligently.
Deployment Configuration
First, deploy the models you want to use. Navigate to your Azure AI Foundry resource and deploy each model from the Models + endpoints page. For this example, we’ll deploy GPT-4o, GPT-4o mini, Claude Sonnet 4.5, and Claude Haiku 4.5.
Create a model router deployment that references these models. The router needs to know which models are available and how to prioritize them.
Here’s the Python code to configure and use model router:
from azure.ai.inference import ChatCompletionsClient
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
# Initialize the client
endpoint = "https://your-foundry-resource.services.ai.azure.com"
credential = DefaultAzureCredential()
client = ChatCompletionsClient(
endpoint=endpoint,
credential=credential
)
# Use model router deployment
model_router_deployment = "model-router"
# Simple query - will likely route to cheaper model
simple_response = client.complete(
model=model_router_deployment,
messages=[
{"role": "user", "content": "What is the capital of France?"}
]
)
print(f"Simple query routed to: {simple_response.model}")
print(f"Response: {simple_response.choices[0].message.content}")
# Complex reasoning - will likely route to frontier model
complex_response = client.complete(
model=model_router_deployment,
messages=[
{"role": "user", "content": """
I have a distributed system with 5 microservices. Service A calls B and C in parallel,
B calls D, C calls E, and D also calls E. If each service has 99.9% availability,
what's the overall system availability? Show your reasoning.
"""}
]
)
print(f"Complex query routed to: {complex_response.model}")
print(f"Response: {complex_response.choices[0].message.content}")
# Coding task - should prefer Claude Sonnet
coding_response = client.complete(
model=model_router_deployment,
messages=[
{"role": "user", "content": """
Write a Python function that implements a retry mechanism with exponential backoff.
Include type hints, error handling, and async support.
"""}
]
)
print(f"Coding query routed to: {coding_response.model}")
print(f"Response: {coding_response.choices[0].message.content}")The model router handles selection automatically. Your application code stays clean and doesn’t need to implement model selection logic.
Node.js Implementation
Here’s the same pattern in Node.js:
import { AzureOpenAI } from "@azure/openai";
import { DefaultAzureCredential } from "@azure/identity";
const endpoint = "https://your-foundry-resource.services.ai.azure.com";
const credential = new DefaultAzureCredential();
const client = new AzureOpenAI({
endpoint,
credential
});
const modelRouter = "model-router";
// Function to query with automatic routing
async function queryWithRouter(prompt, taskType = "general") {
const response = await client.chat.completions.create({
model: modelRouter,
messages: [
{ role: "user", content: prompt }
],
temperature: taskType === "coding" ? 0.2 : 0.7
});
console.log(`Task: ${taskType}`);
console.log(`Routed to model: ${response.model}`);
console.log(`Response: ${response.choices[0].message.content}\n`);
return response;
}
// Examples of different task types
await queryWithRouter(
"Explain quantum entanglement in simple terms",
"explanation"
);
await queryWithRouter(
"Implement a binary search tree in TypeScript with insert and delete operations",
"coding"
);
await queryWithRouter(
"What are the business hours for typical retail stores?",
"factual"
);C# Implementation
For C# applications:
using Azure;
using Azure.AI.Inference;
using Azure.Identity;
var endpoint = new Uri("https://your-foundry-resource.services.ai.azure.com");
var credential = new DefaultAzureCredential();
var client = new ChatCompletionsClient(endpoint, credential);
var modelRouter = "model-router";
// Simple query
var simpleRequest = new ChatCompletionsOptions
{
Messages =
{
new ChatRequestUserMessage("What is photosynthesis?")
},
DeploymentName = modelRouter
};
var simpleResponse = await client.CompleteAsync(simpleRequest);
Console.WriteLine($"Simple query routed to: {simpleResponse.Value.Model}");
Console.WriteLine($"Response: {simpleResponse.Value.Choices[0].Message.Content}");
// Complex analysis
var analysisRequest = new ChatCompletionsOptions
{
Messages =
{
new ChatRequestUserMessage(@"
Analyze the time complexity of quicksort in best, average, and worst cases.
Explain why the worst case occurs and how to mitigate it.
")
},
DeploymentName = modelRouter
};
var analysisResponse = await client.CompleteAsync(analysisRequest);
Console.WriteLine($"Analysis routed to: {analysisResponse.Value.Model}");
Console.WriteLine($"Response: {analysisResponse.Value.Choices[0].Message.Content}");Model Router in Agent Service
The real power of model router appears when integrated with Foundry Agent Service. Agents can automatically select different models for different steps in their workflow.
Consider an agent that helps developers debug code. It might use Claude Sonnet to analyze the code and identify issues, then switch to Claude Haiku for generating test cases, and use GPT-4o for explaining the fixes in natural language. All of this happens automatically based on the task at each step.
Here’s a diagram showing how model router integrates with agent workflows:
flowchart TB
User[User Request] --> Agent[Foundry Agent Service]
Agent --> Router{Model Router}
Router -->|Simple Query| Haiku[Claude Haiku 4.5
Fast & Economical]
Router -->|Coding Task| Sonnet[Claude Sonnet 4.5
Best for Code]
Router -->|Complex Reasoning| Opus[Claude Opus 4.1
Deep Analysis]
Router -->|General Task| GPT4o[GPT-4o
Balanced Performance]
Router -->|Frontier Reasoning| GPT5[GPT-5
Latest Capabilities]
Haiku --> Response[Formatted Response]
Sonnet --> Response
Opus --> Response
GPT4o --> Response
GPT5 --> Response
Response --> Agent
Agent --> User
subgraph Monitoring[" "]
Router -.->|Metrics| Metrics[Azure Monitor]
Router -.->|Cost Tracking| Costs[Cost Management]
end
Cost Optimization with Model Router
The primary benefit of model router is cost reduction without sacrificing quality. Let’s look at real numbers.
Assume you have 1 million requests per day with the following distribution: 40% simple queries, 30% moderate complexity, 20% coding tasks, and 10% complex reasoning.
With a single-model approach using GPT-4o for everything, you might process an average of 500 tokens per request at $5 per million input tokens. That’s $2,500 per day or $75,000 per month.
With model router automatically selecting appropriate models, the same workload breaks down differently. Simple queries route to Claude Haiku at $0.25 per million tokens. Moderate complexity uses GPT-4o mini at $0.15 per million tokens. Coding tasks go to Claude Sonnet at $3 per million tokens. Complex reasoning uses GPT-5 at $5 per million tokens.
The new cost becomes approximately $1,100 per day or $33,000 per month. That’s a 56% reduction in model costs with no degradation in quality for individual tasks.
Monitoring and Observability
Model router provides detailed telemetry about its decisions. Navigate to Monitoring > Metrics in the Azure portal and filter by your model router deployment name.
Split metrics by underlying model to see the distribution of requests. This helps you understand which models handle which percentage of your workload. You can identify patterns and optimize your model selection strategy accordingly.
Cost analysis shows the actual spending by model. This makes it easy to validate that your routing strategy achieves the expected cost savings.
Authentication and Access Control
Model router supports both API key authentication and Microsoft Entra ID. For production systems, use Entra ID with managed identities.
The Azure AI User and Cognitive Services User roles include all required permissions for invoking models through the router. For more restrictive access, create custom roles with specific data actions.
Each model in the router can have independent access controls. This allows you to grant certain applications access to expensive frontier models while restricting others to economical options.
Fallback and Resilience Patterns
Model router automatically handles failures and capacity constraints. If the selected model is unavailable or returns an error, the router tries alternative models that can handle the task.
You can implement additional resilience in your application code:
import time
from azure.core.exceptions import ServiceRequestError, HttpResponseError
def query_with_retry(client, model, messages, max_retries=3):
"""Query with exponential backoff retry logic"""
for attempt in range(max_retries):
try:
response = client.complete(
model=model,
messages=messages,
timeout=30
)
return response
except ServiceRequestError as e:
if attempt == max_retries - 1:
raise
wait_time = (2 ** attempt) + (random.random() * 0.1)
print(f"Request failed, retrying in {wait_time:.2f}s...")
time.sleep(wait_time)
except HttpResponseError as e:
if e.status_code == 429: # Rate limited
if attempt == max_retries - 1:
raise
wait_time = (2 ** attempt) * 2
print(f"Rate limited, waiting {wait_time}s...")
time.sleep(wait_time)
else:
raise
# Usage
response = query_with_retry(
client,
"model-router",
[{"role": "user", "content": "Your prompt here"}]
)Best Practices
Start with a diverse model portfolio. Deploy at least one economical model (Claude Haiku or GPT-4o mini), one balanced model (GPT-4o or Claude Sonnet), and one frontier model (GPT-5 or Claude Opus). This gives the router flexibility to optimize across cost and quality dimensions.
Monitor routing patterns for the first few weeks. Analyze which prompts go to which models and verify the router makes sensible decisions. Adjust your model lineup if you notice inefficiencies.
Set up cost alerts in Azure Cost Management. Even with model router optimizing costs, unexpected usage spikes can occur. Configure alerts at meaningful thresholds.
Test prompt engineering across models. Different models respond differently to the same prompt. Optimize your system prompts for the models you use most frequently.
Update models regularly. New model versions often provide better performance at the same or lower cost. Azure AI Foundry makes it easy to deploy new models and include them in your router configuration.
What’s Next
In the next post, we’ll explore custom model training and fine-tuning workflows. We’ll cover when to fine-tune versus using prompt engineering, how to prepare training data effectively, and strategies for evaluating fine-tuned models against base models. We’ll also look at the cost implications and best practices for maintaining fine-tuned models in production.
Model router and multi-model integration represent a fundamental shift in how we build AI applications. Instead of betting on a single model provider, you can leverage the strengths of each model for the tasks they handle best. Azure AI Foundry makes this practical and economical.
References
- Introducing Anthropic’s Claude models in Microsoft Foundry – Azure Blog
- How to use model router for Microsoft Foundry – Microsoft Learn
- Foundry Models – Microsoft Azure
- Claude now available in Microsoft Foundry and Microsoft 365 Copilot – Anthropic
- Build and scale AI agents with Microsoft Foundry – Azure Blog
