Theory becomes reality in this part. We move from understanding vector databases and comparing options to actually building a production-ready Retrieval Augmented Generation application on Azure. This tutorial provides working code in Python, Node.js, and C#, along with deployment configurations and best practices learned from real-world implementations.
The RAG pattern has become the standard approach for grounding Large Language Models with current, domain-specific information. Rather than retraining models or hoping they know your data, RAG retrieves relevant context at query time and includes it in the prompt. This enables LLMs to answer questions about your proprietary documents, internal knowledge bases, and constantly updating information.
RAG Architecture on Azure
A production RAG system on Azure consists of several integrated components. Understanding the architecture helps you make informed decisions about which services to use and how they interact.
Core Components
Azure OpenAI provides both the embedding model and the chat completion model. The embedding model converts text into vector representations, typically using text-embedding-ada-002 or the newer text-embedding-3-large. The chat model, usually GPT-4 or GPT-4o, generates responses based on retrieved context.
The vector store indexes and retrieves document chunks. Options include Azure AI Search for fully managed operations, Azure Cosmos DB for PostgreSQL with pgvector for SQL integration, or Azure SQL Server 2025 for enterprises standardized on SQL Server. Each offers different tradeoffs between operational complexity, performance, and cost.
Azure Blob Storage holds source documents. The indexing pipeline reads from Blob Storage, chunks documents, generates embeddings, and populates the vector store. Azure Functions or Container Apps orchestrate this pipeline, processing new documents automatically when they arrive.
The query service handles user requests. It generates embeddings for queries, retrieves relevant chunks from the vector store, constructs prompts with context, calls Azure OpenAI for completion, and returns responses with citations. This runs as an API in App Service, Container Apps, or serverless Functions.
graph TB
A[User Query] --> B[API Service]
B --> C[Generate Query Embedding]
C --> D[Azure OpenAI Embeddings]
D --> E[Vector Search]
E --> F[Azure AI Search / pgvector / SQL Server]
F --> G[Retrieve Top K Chunks]
G --> H[Construct Prompt]
H --> I[Azure OpenAI Chat]
I --> J[Generate Response]
J --> K[Return with Citations]
L[Document Upload] --> M[Azure Blob Storage]
M --> N[Indexing Pipeline]
N --> O[Chunk Documents]
O --> P[Generate Embeddings]
P --> D
P --> Q[Store Vectors + Metadata]
Q --> F
R[Components] --> S[Azure OpenAI]
R --> T[Vector Store]
R --> U[Blob Storage]
R --> V[App Service / Functions]
style A fill:#e1f5ff
style L fill:#ffe1e1
style R fill:#e1ffe1Design Decisions
Several architectural decisions significantly impact your RAG implementation. Chunk size determines how much context each embedding represents. Smaller chunks of 200-500 tokens provide precise matching but might lack context. Larger chunks of 1000-2000 tokens include more context but reduce precision. Most production systems use 500-800 tokens with 50-100 token overlap between chunks.
Top K retrieval controls how many chunks you return. More chunks provide better context coverage but increase prompt length and cost. Start with K equals 5 to 10 and adjust based on your average query complexity and response quality.
Metadata filtering improves relevance by restricting search scope. Add metadata like document type, date, department, or security classification. Filter at query time to ensure users only see authorized content and searches focus on relevant document subsets.
Python Implementation with Azure AI Search
This Python implementation demonstrates a complete RAG system using Azure AI Search as the vector store. The code handles document ingestion, chunk processing, embedding generation, and query execution.
import os
from typing import List, Dict
from azure.identity import DefaultAzureCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.models import VectorizedQuery
from azure.search.documents.indexes.models import (
SearchIndex,
SearchField,
SearchFieldDataType,
VectorSearch,
HnswAlgorithmConfiguration,
VectorSearchProfile,
SemanticConfiguration,
SemanticField,
SemanticPrioritizedFields,
SemanticSearch
)
from openai import AzureOpenAI
from azure.storage.blob import BlobServiceClient
import PyPDF2
import io
class AzureRAGSystem:
def __init__(
self,
search_endpoint: str,
search_key: str,
openai_endpoint: str,
openai_key: str,
embedding_deployment: str,
chat_deployment: str,
index_name: str = "documents"
):
"""Initialize Azure RAG system with required credentials"""
self.search_endpoint = search_endpoint
self.index_name = index_name
# Initialize search clients
self.index_client = SearchIndexClient(
endpoint=search_endpoint,
credential=search_key
)
self.search_client = SearchClient(
endpoint=search_endpoint,
index_name=index_name,
credential=search_key
)
# Initialize OpenAI client
self.openai_client = AzureOpenAI(
api_key=openai_key,
api_version="2024-08-01-preview",
azure_endpoint=openai_endpoint
)
self.embedding_deployment = embedding_deployment
self.chat_deployment = chat_deployment
def create_index(self, vector_dimensions: int = 1536):
"""Create search index with vector and semantic search capabilities"""
fields = [
SearchField(
name="id",
type=SearchFieldDataType.String,
key=True,
filterable=True
),
SearchField(
name="content",
type=SearchFieldDataType.String,
searchable=True,
retrievable=True
),
SearchField(
name="contentVector",
type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
searchable=True,
vector_search_dimensions=vector_dimensions,
vector_search_profile_name="myHnswProfile"
),
SearchField(
name="metadata",
type=SearchFieldDataType.String,
searchable=True,
retrievable=True,
filterable=True,
facetable=True
),
SearchField(
name="source",
type=SearchFieldDataType.String,
filterable=True,
retrievable=True
),
SearchField(
name="chunk_id",
type=SearchFieldDataType.Int32,
filterable=True
)
]
# Configure vector search
vector_search = VectorSearch(
algorithms=[
HnswAlgorithmConfiguration(
name="myHnsw",
parameters={
"m": 4,
"efConstruction": 400,
"efSearch": 500,
"metric": "cosine"
}
)
],
profiles=[
VectorSearchProfile(
name="myHnswProfile",
algorithm_configuration_name="myHnsw"
)
]
)
# Configure semantic search
semantic_config = SemanticConfiguration(
name="my-semantic-config",
prioritized_fields=SemanticPrioritizedFields(
content_fields=[SemanticField(field_name="content")]
)
)
semantic_search = SemanticSearch(
configurations=[semantic_config]
)
# Create index
index = SearchIndex(
name=self.index_name,
fields=fields,
vector_search=vector_search,
semantic_search=semantic_search
)
self.index_client.create_or_update_index(index)
print(f"Index '{self.index_name}' created successfully")
def generate_embedding(self, text: str) -> List[float]:
"""Generate embedding for text using Azure OpenAI"""
response = self.openai_client.embeddings.create(
input=text,
model=self.embedding_deployment
)
return response.data[0].embedding
def chunk_text(self, text: str, chunk_size: int = 800, overlap: int = 100) -> List[str]:
"""Split text into overlapping chunks"""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = ' '.join(words[i:i + chunk_size])
if chunk:
chunks.append(chunk)
return chunks
def process_pdf(self, blob_client, source_name: str) -> List[Dict]:
"""Extract text from PDF and create document chunks"""
# Download PDF
pdf_bytes = blob_client.download_blob().readall()
pdf_file = io.BytesIO(pdf_bytes)
# Extract text
pdf_reader = PyPDF2.PdfReader(pdf_file)
full_text = ""
for page in pdf_reader.pages:
full_text += page.extract_text() + " "
# Create chunks
chunks = self.chunk_text(full_text)
# Prepare documents
documents = []
for idx, chunk in enumerate(chunks):
embedding = self.generate_embedding(chunk)
doc = {
"id": f"{source_name}_{idx}",
"content": chunk,
"contentVector": embedding,
"source": source_name,
"chunk_id": idx,
"metadata": f"page_range_{idx}"
}
documents.append(doc)
return documents
def index_documents(self, documents: List[Dict]):
"""Upload documents to search index"""
result = self.search_client.upload_documents(documents=documents)
succeeded = sum([1 for r in result if r.succeeded])
print(f"Indexed {succeeded} document chunks successfully")
def search(
self,
query: str,
top_k: int = 5,
filter_expr: str = None,
use_semantic: bool = True
) -> List[Dict]:
"""Perform hybrid vector + semantic search"""
# Generate query embedding
query_vector = self.generate_embedding(query)
# Create vector query
vector_query = VectorizedQuery(
vector=query_vector,
k_nearest_neighbors=top_k,
fields="contentVector"
)
# Execute search
search_params = {
"search_text": query,
"vector_queries": [vector_query],
"select": ["content", "source", "chunk_id", "metadata"],
"top": top_k
}
if filter_expr:
search_params["filter"] = filter_expr
if use_semantic:
search_params["query_type"] = "semantic"
search_params["semantic_configuration_name"] = "my-semantic-config"
results = self.search_client.search(**search_params)
return [
{
"content": doc["content"],
"source": doc["source"],
"chunk_id": doc["chunk_id"],
"score": doc["@search.score"]
}
for doc in results
]
def generate_response(self, query: str, context_chunks: List[Dict]) -> str:
"""Generate RAG response using retrieved context"""
# Build context from retrieved chunks
context = "\n\n".join([
f"[Source: {chunk['source']}, Chunk: {chunk['chunk_id']}]\n{chunk['content']}"
for chunk in context_chunks
])
# Create prompt
system_prompt = """You are a helpful AI assistant. Answer questions based on the provided context.
If the answer cannot be found in the context, say so clearly.
Always cite your sources using the format [Source: filename, Chunk: number]."""
user_prompt = f"""Context:
{context}
Question: {query}
Answer:"""
# Generate completion
response = self.openai_client.chat.completions.create(
model=self.chat_deployment,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.3,
max_tokens=800
)
return response.choices[0].message.content
def query(self, question: str, top_k: int = 5, filter_expr: str = None) -> Dict:
"""Complete RAG query pipeline"""
# Retrieve relevant chunks
chunks = self.search(question, top_k=top_k, filter_expr=filter_expr)
# Generate response
answer = self.generate_response(question, chunks)
return {
"answer": answer,
"sources": chunks
}
# Example usage
def main():
# Configuration
config = {
"search_endpoint": os.getenv("AZURE_SEARCH_ENDPOINT"),
"search_key": os.getenv("AZURE_SEARCH_KEY"),
"openai_endpoint": os.getenv("AZURE_OPENAI_ENDPOINT"),
"openai_key": os.getenv("AZURE_OPENAI_KEY"),
"embedding_deployment": "text-embedding-ada-002",
"chat_deployment": "gpt-4o",
"blob_connection": os.getenv("AZURE_STORAGE_CONNECTION_STRING")
}
# Initialize RAG system
rag = AzureRAGSystem(
search_endpoint=config["search_endpoint"],
search_key=config["search_key"],
openai_endpoint=config["openai_endpoint"],
openai_key=config["openai_key"],
embedding_deployment=config["embedding_deployment"],
chat_deployment=config["chat_deployment"]
)
# Create index
rag.create_index()
# Process and index documents from Blob Storage
blob_service = BlobServiceClient.from_connection_string(
config["blob_connection"]
)
container_client = blob_service.get_container_client("documents")
for blob in container_client.list_blobs():
if blob.name.endswith('.pdf'):
print(f"Processing {blob.name}...")
blob_client = container_client.get_blob_client(blob.name)
documents = rag.process_pdf(blob_client, blob.name)
rag.index_documents(documents)
# Query the system
question = "What are the key features of Azure AI Search?"
result = rag.query(question, top_k=5)
print(f"\nQuestion: {question}")
print(f"\nAnswer: {result['answer']}")
print(f"\nSources:")
for source in result['sources']:
print(f" - {source['source']} (Chunk {source['chunk_id']})")
if __name__ == "__main__":
main()Node.js Implementation with Azure Cosmos DB
This Node.js implementation uses Azure Cosmos DB for PostgreSQL with pgvector, providing a different architectural approach that keeps vectors alongside relational data.
const { Client } = require('pg');
const { OpenAIClient, AzureKeyCredential } = require("@azure/openai");
const { BlobServiceClient } = require("@azure/storage-blob");
const pdf = require('pdf-parse');
class AzureRAGNodeSystem {
constructor(config) {
this.config = config;
// Initialize PostgreSQL client
this.pgClient = new Client({
host: config.postgresHost,
port: 5432,
database: 'citus',
user: 'citus',
password: config.postgresPassword,
ssl: { rejectUnauthorized: false }
});
// Initialize OpenAI client
this.openaiClient = new OpenAIClient(
config.openaiEndpoint,
new AzureKeyCredential(config.openaiKey)
);
this.embeddingModel = config.embeddingDeployment;
this.chatModel = config.chatDeployment;
}
async connect() {
await this.pgClient.connect();
// Enable pgvector
await this.pgClient.query('CREATE EXTENSION IF NOT EXISTS vector');
}
async createSchema() {
const sql = `
CREATE TABLE IF NOT EXISTS document_chunks (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536),
source VARCHAR(255),
chunk_id INTEGER,
metadata JSONB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_embedding
ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
CREATE INDEX IF NOT EXISTS idx_source
ON document_chunks(source);
`;
await this.pgClient.query(sql);
console.log('Schema created successfully');
}
async generateEmbedding(text) {
const response = await this.openaiClient.getEmbeddings(
this.embeddingModel,
[text]
);
return response.data[0].embedding;
}
chunkText(text, chunkSize = 800, overlap = 100) {
const words = text.split(/\s+/);
const chunks = [];
for (let i = 0; i < words.length; i += chunkSize - overlap) {
const chunk = words.slice(i, i + chunkSize).join(' ');
if (chunk.trim()) {
chunks.push(chunk);
}
}
return chunks;
}
async processPdf(blobClient, sourceName) {
// Download PDF
const downloadResponse = await blobClient.download();
const buffer = await streamToBuffer(downloadResponse.readableStreamBody);
// Extract text
const data = await pdf(buffer);
const fullText = data.text;
// Create chunks
const chunks = this.chunkText(fullText);
// Generate embeddings and prepare documents
const documents = [];
for (let idx = 0; idx < chunks.length; idx++) {
const embedding = await this.generateEmbedding(chunks[idx]);
documents.push({
content: chunks[idx],
embedding,
source: sourceName,
chunk_id: idx,
metadata: { page_range: idx }
});
}
return documents;
}
async indexDocuments(documents) {
const values = [];
const placeholders = [];
let paramCount = 1;
for (const doc of documents) {
placeholders.push(
`($${paramCount}, $${paramCount + 1}, $${paramCount + 2}, $${paramCount + 3}, $${paramCount + 4})`
);
values.push(
doc.content,
JSON.stringify(doc.embedding),
doc.source,
doc.chunk_id,
JSON.stringify(doc.metadata)
);
paramCount += 5;
}
const sql = `
INSERT INTO document_chunks (content, embedding, source, chunk_id, metadata)
VALUES ${placeholders.join(', ')}
`;
await this.pgClient.query(sql, values);
console.log(`Indexed ${documents.length} document chunks`);
}
async search(query, topK = 5, filterSource = null) {
const queryEmbedding = await this.generateEmbedding(query);
let sql = `
SELECT
id,
content,
source,
chunk_id,
metadata,
1 - (embedding <=> $1::vector) as similarity
FROM document_chunks
`;
const params = [JSON.stringify(queryEmbedding)];
let paramCount = 2;
if (filterSource) {
sql += ` WHERE source = $${paramCount}`;
params.push(filterSource);
paramCount++;
}
sql += `
ORDER BY embedding <=> $1::vector
LIMIT $${paramCount}
`;
params.push(topK);
const result = await this.pgClient.query(sql, params);
return result.rows;
}
async generateResponse(query, contextChunks) {
// Build context
const context = contextChunks.map(chunk =>
`[Source: ${chunk.source}, Chunk: ${chunk.chunk_id}]\n${chunk.content}`
).join('\n\n');
// Create prompt
const messages = [
{
role: "system",
content: "You are a helpful AI assistant. Answer questions based on the provided context. If the answer cannot be found in the context, say so clearly. Always cite your sources."
},
{
role: "user",
content: `Context:\n${context}\n\nQuestion: ${query}\n\nAnswer:`
}
];
// Generate completion
const response = await this.openaiClient.getChatCompletions(
this.chatModel,
messages,
{
temperature: 0.3,
maxTokens: 800
}
);
return response.choices[0].message.content;
}
async query(question, topK = 5, filterSource = null) {
// Retrieve relevant chunks
const chunks = await this.search(question, topK, filterSource);
// Generate response
const answer = await this.generateResponse(question, chunks);
return {
answer,
sources: chunks.map(chunk => ({
source: chunk.source,
chunk_id: chunk.chunk_id,
similarity: chunk.similarity
}))
};
}
async disconnect() {
await this.pgClient.end();
}
}
// Helper function
async function streamToBuffer(readableStream) {
return new Promise((resolve, reject) => {
const chunks = [];
readableStream.on('data', (data) => {
chunks.push(data instanceof Buffer ? data : Buffer.from(data));
});
readableStream.on('end', () => {
resolve(Buffer.concat(chunks));
});
readableStream.on('error', reject);
});
}
// Example usage
async function main() {
const config = {
postgresHost: process.env.COSMOS_DB_HOST,
postgresPassword: process.env.COSMOS_DB_PASSWORD,
openaiEndpoint: process.env.AZURE_OPENAI_ENDPOINT,
openaiKey: process.env.AZURE_OPENAI_KEY,
embeddingDeployment: 'text-embedding-ada-002',
chatDeployment: 'gpt-4o',
storageConnection: process.env.AZURE_STORAGE_CONNECTION_STRING
};
const rag = new AzureRAGNodeSystem(config);
try {
await rag.connect();
await rag.createSchema();
// Process documents from Blob Storage
const blobServiceClient = BlobServiceClient.fromConnectionString(
config.storageConnection
);
const containerClient = blobServiceClient.getContainerClient('documents');
for await (const blob of containerClient.listBlobsFlat()) {
if (blob.name.endsWith('.pdf')) {
console.log(`Processing ${blob.name}...`);
const blobClient = containerClient.getBlobClient(blob.name);
const documents = await rag.processPdf(blobClient, blob.name);
await rag.indexDocuments(documents);
}
}
// Query the system
const question = "What are the key features of Azure Cosmos DB?";
const result = await rag.query(question, 5);
console.log(`\nQuestion: ${question}`);
console.log(`\nAnswer: ${result.answer}`);
console.log(`\nSources:`);
result.sources.forEach(source => {
console.log(` - ${source.source} (Chunk ${source.chunk_id}, Similarity: ${source.similarity.toFixed(3)})`);
});
} finally {
await rag.disconnect();
}
}
main().catch(console.error);Best Practices for Production
Moving from prototype to production requires attention to performance, reliability, security, and cost optimization. These best practices come from real-world deployments.
Chunking Strategy
Effective chunking significantly impacts retrieval quality. Use semantic chunking that respects document structure. Keep paragraphs together, do not split mid-sentence, and preserve logical boundaries like section headers. Add metadata about section titles and document hierarchy to each chunk.
Implement chunk overlap to prevent context loss at boundaries. A 100-token overlap ensures that information spanning chunk boundaries appears in at least one complete chunk. This improves retrieval recall for queries that match edge content.
Prompt Engineering
System prompts control how the model uses retrieved context. Instruct it to cite sources, admit when information is missing, and avoid speculation beyond the provided context. Include examples of good responses showing proper citation format.
Limit response length to prevent rambling. Set appropriate token limits based on your UI constraints. For chat interfaces, 500-800 tokens usually suffices. For detailed analysis, allow 1500-2000 tokens.
Caching and Performance
Cache embeddings for frequently accessed documents. Generating embeddings costs both time and money. Store embeddings with documents and only regenerate when content changes.
Implement query caching for common questions. Many users ask similar questions. Cache responses for identical queries, expiring after a reasonable period. This dramatically reduces costs for high-traffic applications.
Use batch processing for index updates. Instead of indexing documents immediately on upload, batch them and process together. This improves throughput and reduces API calls.
Security and Access Control
Implement document-level security through metadata filters. Tag each chunk with access control information. Filter searches based on user permissions, ensuring users only retrieve authorized content.
Use managed identities for service-to-service authentication. Avoid hardcoded credentials. Configure Azure services to authenticate using managed identities, reducing security risk and operational overhead.
Log all queries and responses for audit purposes. Track who asked what questions and what information was retrieved. This supports compliance requirements and helps identify potential security issues.
graph TB
A[Production Best Practices] --> B[Chunking]
A --> C[Prompts]
A --> D[Performance]
A --> E[Security]
B --> B1[Semantic Boundaries]
B --> B2[500-800 tokens]
B --> B3[100 token overlap]
B --> B4[Preserve Structure]
C --> C1[Clear Instructions]
C --> C2[Citation Format]
C --> C3[Token Limits]
C --> C4[Example Responses]
D --> D1[Cache Embeddings]
D --> D2[Cache Queries]
D --> D3[Batch Processing]
D --> D4[Connection Pooling]
E --> E1[Metadata Filters]
E --> E2[Managed Identity]
E --> E3[Audit Logging]
E --> E4[Access Control]
F[Monitoring] --> G[Latency p95]
F --> H[Cost per Query]
F --> I[Cache Hit Rate]
F --> J[Error Rate]
style A fill:#e1f5ff
style F fill:#ffe1e1Deployment on Azure
Production deployment requires infrastructure configuration, monitoring, and scaling strategies. Azure provides multiple hosting options depending on your requirements.
Hosting Options
Azure App Service provides the simplest deployment path for web APIs. It handles scaling, SSL certificates, and deployment slots automatically. Use App Service for straightforward RAG APIs serving web or mobile clients.
Container Apps suit microservice architectures or when you need more control over the runtime environment. Deploy your RAG service as a container, scale based on HTTP requests or queue depth, and integrate with other containerized services.
Azure Functions work for event-driven or batch processing scenarios. Trigger document indexing when files arrive in Blob Storage. Run periodic reindexing jobs on a schedule. Functions provide excellent cost efficiency for intermittent workloads.
Infrastructure as Code
Use Azure Developer CLI or Bicep templates to define infrastructure. Version control your infrastructure alongside application code. This enables reproducible deployments across development, staging, and production environments.
The Azure Developer CLI provides templates specifically for RAG applications. These templates provision all required Azure services, configure networking and security, and deploy your application code with a single command.
Monitoring and Observability
Application Insights automatically collects telemetry from App Service and Functions. Track request latency, error rates, and dependency performance. Set up alerts for degraded performance or increased error rates.
Add custom metrics for RAG-specific concerns. Track embedding generation time, vector search latency, prompt token counts, and completion token counts. Monitor costs by counting API calls and tokens consumed.
Implement distributed tracing to follow requests across services. A single query touches Azure OpenAI for embeddings, your vector store for retrieval, and Azure OpenAI again for completion. Tracing shows where latency occurs and helps identify bottlenecks.
What’s Next
This part provided production-ready code for building RAG applications on Azure. You now have working implementations in Python and Node.js, along with best practices for chunking, prompting, caching, security, and deployment.
Part 5 will cover advanced optimization techniques including reranking strategies, hybrid search tuning, prompt engineering patterns, and cost optimization. We will explore how to improve retrieval accuracy, reduce latency, and minimize Azure OpenAI costs while maintaining response quality.
The implementation foundation is solid. Time to optimize.
References
- Pondhouse Data – Azure AI Search RAG Tutorial 2025
- Microsoft Learn – RAG application with Azure OpenAI and Azure AI Search (.NET)
- Microsoft Learn – RAG application with Azure OpenAI and Azure AI Search (Python)
- Microsoft Learn – RAG tutorial: Search using an LLM – Azure AI Search
- Medium – Building your first RAG pipeline with Langchain and Azure OpenAI service
- Medium – Implementing Retrieval Augmented Generation (RAG) with Azure OpenAI and Langchain
- LlamaIndex – Building a serverless RAG application with LlamaIndex and Azure OpenAI
- Microsoft Learn – Build a retrieval-augmented generation solution with Azure Content Understanding
- DEV Community – Implementing RAG with Azure OpenAI in .NET (C#)
