Azure AI Foundry Deep Dive Series Part 4: Custom Model Training and Fine-Tuning Workflows

Azure AI Foundry Deep Dive Series Part 4: Custom Model Training and Fine-Tuning Workflows

Foundation models are powerful generalists, but production applications often need specialists. Fine-tuning transforms a pretrained model into a domain expert by training it on your specific data. In this post, we’ll explore when to fine-tune versus prompt engineering, how to prepare training data effectively, and the practical steps for fine-tuning models in Azure AI Foundry. We’ll cover both serverless and managed compute approaches with working examples.

When to Fine-Tune vs Prompt Engineering

Prompt engineering gets you remarkably far. Adding examples to your prompts (few-shot learning), crafting clear instructions, and optimizing your system message can solve many problems without training. But fine-tuning becomes necessary when you hit specific limitations.

The context window constraint matters most. Even GPT-5’s 200K token context can’t hold thousands of examples. If your task requires learning from extensive datasets, fine-tuning trains the model weights themselves rather than cramming examples into each request.

Consistency requirements drive fine-tuning decisions. When you need every response to follow specific formatting, use particular terminology, or maintain a distinct voice, fine-tuning embeds these patterns into the model. Prompt engineering can drift as conversations evolve. Fine-tuned models stay consistent.

Cost optimization becomes significant at scale. Each prompt with 50 few-shot examples burns 10K+ tokens before the model even sees the actual query. Fine-tuning reduces this overhead dramatically. You include fewer examples per request because the model has already learned your patterns.

Latency matters for real-time applications. Smaller fine-tuned models like GPT-4.1-nano can match larger models’ performance on specific tasks while responding much faster. A fine-tuned nano model costs 90% less per token than GPT-4o while delivering comparable quality for narrow use cases.

Fine-Tuning Techniques Available

Azure AI Foundry offers three primary fine-tuning techniques, each optimized for different scenarios.

Supervised Fine-Tuning (SFT)

Supervised fine-tuning trains models on input-output pairs, teaching them to produce desired responses for specific inputs. This foundational technique handles most use cases including domain specialization, task performance improvement, style and tone adjustment, instruction following, and language adaptation.

Start with SFT for most projects. It addresses the broadest range of scenarios and provides reliable results with clear training data. You prepare examples showing what input should produce what output, and the model learns these patterns.

Reinforcement Fine-Tuning (RFT)

Reinforcement fine-tuning goes beyond simple input-output mapping. It teaches models to reason through complex problems, learn from feedback, and improve decision-making over time. This technique works best for reasoning engines, adaptive workflows, dynamic response generation, and context-sensitive interactions.

RFT with o4-mini became available recently, marking the first reasoning model you can fine-tune in Azure. Organizations building AI systems that need to handle nuanced scenarios or incorporate custom decision rules benefit significantly from this approach.

DraftWise, a legal tech startup, used RFT to enhance contract generation and review. They fine-tuned Azure OpenAI models using proprietary legal data, achieving a 30% improvement in search result quality. This enabled lawyers to draft contracts faster while maintaining legal accuracy.

Direct Preference Optimization (DPO)

DPO trains models by showing them preferred versus non-preferred outputs for the same input. Instead of just showing correct answers, you provide examples of better and worse responses. The model learns to distinguish quality and align with human preferences.

This technique excels when you have subjective quality criteria that are hard to express as rules. Content quality, conversational naturalness, and safety alignment all benefit from DPO.

Choosing the Right Model to Fine-Tune

Model selection depends on your task complexity and budget constraints. For complex skills like language translation, domain adaptation, or advanced code generation, start with GPT-4.1. Its strong base capabilities make fine-tuning more effective.

For focused tasks such as classification, sentiment analysis, or content moderation, GPT-4.1-mini provides faster iteration and lower costs. It’s also ideal when distilling knowledge from more sophisticated models.

GPT-4.1-nano represents the most economical option for high-volume scenarios. Despite being small, it can match larger models’ performance on narrow tasks after fine-tuning. Organizations running millions of queries daily achieve substantial cost savings with fine-tuned nano models.

Meta’s Llama 4 Scout offers an open-source alternative with a remarkable 10 million token context window. Fine-tuning Llama 4 requires managed compute with your own GPUs but provides full control over the deployment environment.

Serverless vs Managed Compute

Azure AI Foundry provides two deployment modalities for fine-tuning. Understanding their tradeoffs helps you choose appropriately.

Serverless Fine-Tuning

Serverless fine-tuning uses Microsoft’s infrastructure with consumption-based pricing starting at $1.70 per million input tokens. Microsoft handles all infrastructure management, optimizes training for speed and scalability, and provides access to OpenAI models exclusively.

This approach requires no GPU quotas, making it accessible to organizations without substantial compute allocations. However, it offers fewer hyperparameter options than managed compute and limits you to models Microsoft has pre-configured for serverless deployment.

Training tiers provide flexibility. Standard training occurs in your Foundry resource’s region, ensuring data residency. Global training leverages capacity beyond your current region for lower costs and faster queue times. Developer tier (preview) offers significant cost savings by using idle capacity, though without latency guarantees or SLAs.

Managed Compute Fine-Tuning

Managed compute offers a wider range of models and advanced customization through Azure Machine Learning. You provide your own VMs for training and hosting, gaining full control over resources and access to models not available in serverless mode, including Llama 4 and other open-source options.

The tradeoff involves complexity. You need sufficient GPU quotas, which many organizations lack. You manage infrastructure scaling and optimization yourself. However, this control enables scenarios that serverless deployments can’t support, particularly when working with proprietary models or requiring specific hardware configurations.

Preparing Training Data

Training data quality determines fine-tuning success more than any other factor. Poor quality data negatively trains the model, reduces accuracy, and increases errors.

Data Format Requirements

For GPT models, training data must use JSONL format for the Chat Completions API. JSONL is a JSON variant where data is delimited by newlines, allowing each line to contain a complete conversation example.

Here’s an example of properly formatted training data:

{"messages": [{"role": "system", "content": "You are a travel assistant helping customers plan vacations."}, {"role": "user", "content": "I want to visit Europe in the summer"}, {"role": "assistant", "content": "Europe in summer is wonderful! To help you plan the perfect trip, could you tell me: your budget range, how many days you have available, and whether you prefer cities, countryside, or a mix of both?"}]}
{"messages": [{"role": "system", "content": "You are a travel assistant helping customers plan vacations."}, {"role": "user", "content": "We have 2 weeks and $5000 for two people"}, {"role": "assistant", "content": "Great! With a $5000 budget for two weeks, you have excellent options. I'd recommend focusing on 2-3 countries to really experience them. Popular combinations include: Spain and Portugal for beaches and culture, Italy for history and cuisine, or France and Switzerland for diverse landscapes. Which sounds most appealing?"}]}

Each line represents a complete training example. Multi-turn conversations teach the model to maintain context across exchanges.

Data Quality Guidelines

Quantity matters significantly. The system technically supports as few as 10 examples, but this is far too little. Microsoft recommends thousands of examples for effective tuning. If you think you have enough training data, you probably need more.

Diversity in your examples prevents overfitting. Include variations in how users might ask the same question. Show edge cases and unusual scenarios. Cover the full range of responses your model should produce.

Consistency in formatting and style ensures the model learns clear patterns. All examples should follow the same structure. If your assistant should always ask clarifying questions before making recommendations, demonstrate this pattern consistently across examples.

Balance across categories prevents bias. If you’re training a classification model, ensure equal representation of each class. For conversational models, balance short and long responses, simple and complex queries.

Creating Training Data from Production

The best training data often comes from production usage. Azure AI Foundry’s stored completions feature captures model outputs during evaluation, providing real-world examples of how your system behaves.

Review these completions with subject matter experts. Mark high-quality responses as positive examples. Identify problematic outputs and create corrected versions. This process builds a dataset grounded in actual usage rather than hypothetical scenarios.

Fine-Tuning Workflow Implementation

Let’s walk through the complete fine-tuning process with practical code examples.

Step 1: Upload Training Data

First, upload your prepared training data to Azure AI Foundry:

import os
from azure.ai.inference import ChatCompletionsClient
from azure.identity import DefaultAzureCredential

endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
credential = DefaultAzureCredential()

client = ChatCompletionsClient(endpoint=endpoint, credential=credential)

# Upload training file
with open("training_data.jsonl", "rb") as f:
    training_file = client.files.create(
        file=f,
        purpose="fine-tune"
    )

print(f"Training file uploaded: {training_file.id}")

# Upload validation file (optional but recommended)
with open("validation_data.jsonl", "rb") as f:
    validation_file = client.files.create(
        file=f,
        purpose="fine-tune"
    )

print(f"Validation file uploaded: {validation_file.id}")

Step 2: Create Fine-Tuning Job

Now create the fine-tuning job with your uploaded data:

import requests
import json

# Configuration
api_version = "2025-04-01-preview"
model_to_finetune = "gpt-4.1-2025-04-14"
training_type = "Global"  # or "Standard" or "Developer"

# Create fine-tuning job
url = f"{endpoint}/openai/fine_tuning/jobs?api-version={api_version}"
headers = {
    "Content-Type": "application/json",
    "api-key": os.environ["AZURE_OPENAI_API_KEY"]
}

payload = {
    "model": model_to_finetune,
    "training_file": training_file.id,
    "validation_file": validation_file.id,
    "seed": 105,  # For reproducibility
    "hyperparameters": {
        "n_epochs": 3,
        "batch_size": 8,
        "learning_rate_multiplier": 0.1
    },
    "suffix": "travel-assistant-v1",
    "extra_body": {
        "training_type": training_type
    }
}

response = requests.post(url, headers=headers, json=payload)
job = response.json()

print(f"Fine-tuning job created: {job['id']}")
print(f"Status: {job['status']}")

Step 3: Monitor Training Progress

Track the job status as training progresses:

import time

def check_job_status(job_id):
    """Monitor fine-tuning job progress"""
    url = f"{endpoint}/openai/fine_tuning/jobs/{job_id}?api-version={api_version}"
    headers = {"api-key": os.environ["AZURE_OPENAI_API_KEY"]}
    
    while True:
        response = requests.get(url, headers=headers)
        job_status = response.json()
        
        status = job_status["status"]
        print(f"Job status: {status}")
        
        if status == "succeeded":
            print(f"Fine-tuned model: {job_status['fine_tuned_model']}")
            return job_status["fine_tuned_model"]
        elif status == "failed":
            print(f"Job failed: {job_status.get('error', 'Unknown error')}")
            return None
        elif status == "cancelled":
            print("Job was cancelled")
            return None
        
        # Check every 60 seconds
        time.sleep(60)

fine_tuned_model = check_job_status(job["id"])

Step 4: Deploy Fine-Tuned Model

Once training completes, deploy your fine-tuned model:

deployment_name = "travel-assistant-v1"

# Create deployment via Azure CLI
os.system(f"""
az cognitiveservices account deployment create \
    --resource-group {resource_group} \
    --name {azure_openai_resource} \
    --deployment-name {deployment_name} \
    --model-name {fine_tuned_model} \
    --model-version "1" \
    --model-format OpenAI \
    --sku-name "Standard" \
    --sku-capacity 10
""")

Step 5: Test the Fine-Tuned Model

Compare your fine-tuned model against the base model:

def test_model(deployment_name, test_prompt):
    """Test a deployed model"""
    response = client.complete(
        model=deployment_name,
        messages=[
            {"role": "system", "content": "You are a travel assistant."},
            {"role": "user", "content": test_prompt}
        ],
        temperature=0.7
    )
    return response.choices[0].message.content

# Test prompts
test_prompts = [
    "I want to visit Japan",
    "Plan a honeymoon trip",
    "Best time to visit Europe"
]

print("=== Base Model Responses ===")
for prompt in test_prompts:
    response = test_model("gpt-4o", prompt)
    print(f"Prompt: {prompt}")
    print(f"Response: {response}\n")

print("=== Fine-Tuned Model Responses ===")
for prompt in test_prompts:
    response = test_model(deployment_name, prompt)
    print(f"Prompt: {prompt}")
    print(f"Response: {response}\n")

Node.js Fine-Tuning Implementation

Here’s the equivalent workflow in Node.js:

import { AzureOpenAI } from "@azure/openai";
import { DefaultAzureCredential } from "@azure/identity";
import * as fs from "fs";

const endpoint = process.env.AZURE_OPENAI_ENDPOINT;
const credential = new DefaultAzureCredential();

const client = new AzureOpenAI({ endpoint, credential });

async function uploadTrainingData() {
    // Upload training file
    const trainingFile = await client.files.create({
        file: fs.createReadStream("training_data.jsonl"),
        purpose: "fine-tune"
    });
    
    console.log(`Training file uploaded: ${trainingFile.id}`);
    
    // Upload validation file
    const validationFile = await client.files.create({
        file: fs.createReadStream("validation_data.jsonl"),
        purpose: "fine-tune"
    });
    
    console.log(`Validation file uploaded: ${validationFile.id}`);
    
    return { trainingFile, validationFile };
}

async function createFineTuningJob(trainingFileId, validationFileId) {
    const job = await client.fineTuning.jobs.create({
        model: "gpt-4.1-2025-04-14",
        training_file: trainingFileId,
        validation_file: validationFileId,
        hyperparameters: {
            n_epochs: 3,
            batch_size: 8,
            learning_rate_multiplier: 0.1
        },
        suffix: "travel-assistant-v1"
    });
    
    console.log(`Fine-tuning job created: ${job.id}`);
    return job;
}

async function monitorJob(jobId) {
    while (true) {
        const job = await client.fineTuning.jobs.retrieve(jobId);
        console.log(`Job status: ${job.status}`);
        
        if (job.status === "succeeded") {
            console.log(`Fine-tuned model: ${job.fine_tuned_model}`);
            return job.fine_tuned_model;
        } else if (job.status === "failed" || job.status === "cancelled") {
            throw new Error(`Job ${job.status}`);
        }
        
        await new Promise(resolve => setTimeout(resolve, 60000));
    }
}

// Execute the workflow
const { trainingFile, validationFile } = await uploadTrainingData();
const job = await createFineTuningJob(trainingFile.id, validationFile.id);
const fineTunedModel = await monitorJob(job.id);

Evaluation and Iteration

Fine-tuning is iterative. Your first attempt rarely produces optimal results. Azure AI Foundry provides comprehensive evaluation tools to measure performance and guide improvements.

Quick Evals enable rapid testing during development. Run evaluations on small test sets to validate training direction before committing to full evaluation runs.

Python Grader allows custom evaluation logic. Write Python functions that score model outputs according to your specific criteria. This flexibility handles domain-specific metrics that generic evaluations miss.

Auto-Evals for RFT provide observability into reasoning models’ decision-making. Understand how the model arrived at its answers and identify reasoning patterns that need refinement.

Compare multiple fine-tuned models simultaneously to choose the best candidate for production. Side-by-side evaluation reveals subtle differences in quality, consistency, and style.

Cost Management

Fine-tuning incurs costs during both training and inference. Training costs depend on the model size, dataset size, and number of epochs. A typical GPT-4.1-mini fine-tuning job with 1000 examples costs approximately $20-40.

Inference costs for fine-tuned models match their base model counterparts. A fine-tuned GPT-4.1-nano costs the same per token as the base nano model. The cost savings come from using smaller models effectively rather than reduced per-token pricing.

Developer tier training provides significant savings for experimentation. Use this tier during development and switch to Standard or Global for production training runs.

Free hosting for 24 hours per deployment lets you test fine-tuned models without immediate inference costs. Deploy, evaluate thoroughly, and only commit to paid hosting when satisfied with performance.

Best Practices

Start with prompt engineering before fine-tuning. Exhaust prompt optimization techniques first. Fine-tuning should solve problems that prompting can’t, not replace basic optimization.

Begin with small datasets and iterate. Fine-tune on 100-500 examples first. Evaluate results, identify gaps, and expand your dataset strategically. This approach avoids wasting resources on large-scale training runs that might head in the wrong direction.

Use validation data to prevent overfitting. Hold out 10-20% of your data for validation. Monitor validation loss during training. If training loss decreases while validation loss increases, your model is memorizing rather than learning.

Version your datasets and models meticulously. Record which dataset version produced which model. Track hyperparameters, training dates, and evaluation results. This discipline enables reproducibility and helps diagnose quality regressions.

Implement continuous fine-tuning for production systems. As you collect more production data, retrain models periodically. This keeps them aligned with current usage patterns and prevents quality drift.

What’s Next

In the next post, we’ll explore cost optimization strategies for AI workloads in Azure AI Foundry. We’ll cover techniques for reducing model costs, optimizing compute resources, implementing caching strategies, and architecting systems that balance quality with budget constraints. We’ll also examine real-world case studies showing how organizations achieved 60-80% cost reductions without sacrificing performance.

Fine-tuning transforms general-purpose models into domain specialists. Azure AI Foundry makes this accessible with serverless infrastructure, comprehensive tooling, and flexible deployment options. Start small, measure results, and iterate toward production-quality models.

References

Written by:

465 Posts

View All Posts
Follow Me :