Vector Databases: From Hype to Production Reality – Part 2: Architecture and Indexing Algorithms

Vector Databases: From Hype to Production Reality – Part 2: Architecture and Indexing Algorithms

In Part 1, we explored what vector databases are and why they have become essential infrastructure for AI applications. Now we dive into the technical depths: how these databases actually work, the algorithms that power billion-scale searches, and the architectural decisions that separate production-ready systems from proof-of-concept demos.

Understanding vector database architecture is not just academic exercise. The choices you make about indexing algorithms, storage formats, and search strategies directly impact whether your system returns results in 10 milliseconds or 10 seconds, whether it consumes 10 gigabytes or 10 terabytes of memory, and whether it achieves 95% recall or 60% recall on your queries.

Native vs Multimodel Vector Databases

The vector database landscape divides into two fundamental categories: native vector databases built from the ground up for high-dimensional similarity search, and multimodel databases that added vector capabilities to existing systems.

Native Vector Databases

Native vector databases like Pinecone, Milvus, Weaviate, and Qdrant were designed specifically for vector operations. Their entire architecture optimizes for storing, indexing, and searching high-dimensional embeddings. Every component, from storage layout to query execution, assumes vector data as the primary workload.

The advantages are substantial. Native systems can make aggressive optimizations impossible in general-purpose databases. They implement specialized index structures like HNSW graphs that require tight integration between storage and query processing. Memory management strategies assume vector-specific access patterns. The entire system can be tuned for the unique characteristics of similarity search.

However, native vector databases introduce operational complexity. You now have another database to manage, monitor, and maintain. Data synchronization becomes necessary if you need to combine vector search with traditional database operations. The learning curve for teams familiar with SQL databases can be steep.

Multimodel Databases

Multimodel databases added vector search to existing systems. PostgreSQL with pgvector, Azure SQL Server 2025, Elasticsearch, MongoDB Atlas, and Redis all now support vector operations alongside their traditional functionality.

The appeal is obvious. Teams already running PostgreSQL do not need to introduce a new database. Vector data lives alongside relational data, eliminating synchronization complexity. A single query can combine traditional filters with vector similarity search. The operational burden stays the same.

The tradeoff is performance. Multimodel databases must balance vector operations with their primary workload. Index structures cannot be as aggressive because they need to coexist with B-trees and other traditional indexes. Memory management serves multiple data types. Query optimization considers many access patterns, not just vector similarity.

For many production systems, multimodel databases provide the right balance. The performance gap has narrowed considerably. PostgreSQL with pgvector handles millions of vectors efficiently. Azure SQL Server 2025 brings enterprise-grade vector capabilities. Unless you are operating at massive scale or need the absolute fastest performance, multimodel solutions often make architectural sense.

graph TB
    A[Vector Database Types] --> B[Native Vector Databases]
    A --> C[Multimodel Databases]
    
    B --> D[Pinecone]
    B --> E[Milvus]
    B --> F[Weaviate]
    B --> G[Qdrant]
    B --> H[Chroma]
    
    C --> I[PostgreSQL + pgvector]
    C --> J[Azure SQL Server 2025]
    C --> K[Elasticsearch]
    C --> L[MongoDB Atlas]
    C --> M[Redis]
    
    B --> N[Advantages]
    N --> N1[Optimized Performance]
    N --> N2[Advanced Algorithms]
    N --> N3[Vector-First Design]
    
    B --> O[Tradeoffs]
    O --> O1[Operational Complexity]
    O --> O2[Data Synchronization]
    O --> O3[Additional System]
    
    C --> P[Advantages]
    P --> P1[Single Database]
    P --> P2[Familiar Operations]
    P --> P3[Combined Queries]
    
    C --> Q[Tradeoffs]
    Q --> Q1[Performance Compromises]
    Q --> Q2[Feature Lag]
    Q --> Q3[Shared Resources]
    
    style A fill:#e1f5ff
    style B fill:#ffe1e1
    style C fill:#e1ffe1

Indexing Algorithms: HNSW Deep Dive

Hierarchical Navigable Small World graphs represent the state-of-the-art for approximate nearest neighbor search. Understanding HNSW is essential because it powers most production vector databases, from Pinecone to pgvector.

The Core Concept: Navigable Small Worlds

HNSW builds on the small world phenomenon. Just as any two people can be connected through roughly six degrees of separation, HNSW constructs graphs where any two vectors can be reached through a small number of hops. The graph combines long-range connections for rapid navigation with short-range connections for precise local search.

Think of it like navigating a city. Interstate highways get you quickly to the right region. Local roads bring you to the right neighborhood. Side streets lead to the exact address. HNSW implements this same hierarchical navigation in high-dimensional space.

The Hierarchical Structure

HNSW creates multiple layers of graphs stacked on top of each other. The bottom layer contains all vectors. Each upper layer contains a progressively smaller subset, sampled with exponentially decaying probability. The top layer might have only a handful of vectors, while the bottom layer contains millions.

Each layer is a navigable small world graph where vectors connect to their nearest neighbors. Upper layers have longer connections spanning greater distances. Lower layers have denser, shorter connections providing fine-grained accuracy.

Search begins at the top layer with a fixed entry point. The algorithm greedily navigates to the nearest neighbor at each step. When no closer neighbor exists in the current layer, search descends to the layer below and continues. This process repeats until reaching the bottom layer, where the final nearest neighbors are identified.

graph TB
    A[HNSW Structure] --> B[Layer 3: Sparse]
    A --> C[Layer 2: Medium Density]
    A --> D[Layer 1: Dense]
    A --> E[Layer 0: All Vectors]
    
    B --> B1[2-4 vectors]
    B --> B2[Long connections]
    B --> B3[Fast navigation]
    
    C --> C1[Hundreds of vectors]
    C --> C2[Medium connections]
    
    D --> D1[Thousands of vectors]
    D --> D2[Short connections]
    
    E --> E1[All data points]
    E --> E2[Precise connections]
    E --> E3[Final refinement]
    
    F[Search Process] --> G[Start at top layer]
    G --> H[Greedy navigation]
    H --> I[Move to nearest neighbor]
    I --> J{Found local minimum?}
    J -->|No| I
    J -->|Yes| K[Descend to next layer]
    K --> L{At bottom layer?}
    L -->|No| H
    L -->|Yes| M[Return k nearest neighbors]
    
    style A fill:#e1f5ff
    style F fill:#ffe1e1
    style M fill:#e1ffe1

Key Parameters

HNSW performance depends on three critical parameters that control the tradeoff between speed, accuracy, and memory usage.

M determines the maximum number of connections per node in each layer. Higher M creates denser graphs with better recall but increases memory usage and construction time. Typical values range from 16 to 64. For most applications, M equals 16 provides good balance.

efConstruction controls thoroughness during index building. The algorithm maintains a dynamic list of efConstruction candidate neighbors when inserting new vectors. Higher values produce better quality graphs but slow construction. Common values range from 100 to 400.

efSearch determines search thoroughness at query time. The algorithm explores efSearch candidates at each layer. Higher values improve recall at the cost of latency. Unlike efConstruction, efSearch can be tuned at query time, allowing dynamic tradeoffs between speed and accuracy.

Here is a Python implementation demonstrating HNSW with configurable parameters:

import numpy as np
import hnswlib

class HNSWIndex:
    def __init__(self, dimension, max_elements=10000, M=16, ef_construction=200):
        """
        Initialize HNSW index
        
        Args:
            dimension: Vector dimensionality
            max_elements: Maximum number of vectors
            M: Maximum number of connections per node
            ef_construction: Size of dynamic candidate list during construction
        """
        self.dimension = dimension
        self.index = hnswlib.Index(space='cosine', dim=dimension)
        
        # Initialize index
        self.index.init_index(
            max_elements=max_elements,
            M=M,
            ef_construction=ef_construction,
            random_seed=42
        )
        
        self.current_count = 0
    
    def add_vectors(self, vectors, ids=None):
        """Add vectors to the index"""
        if ids is None:
            ids = np.arange(self.current_count, 
                          self.current_count + len(vectors))
        
        self.index.add_items(vectors, ids)
        self.current_count += len(vectors)
        
        return ids
    
    def search(self, query_vectors, k=10, ef_search=50):
        """
        Search for k nearest neighbors
        
        Args:
            query_vectors: Query vectors
            k: Number of nearest neighbors
            ef_search: Size of dynamic candidate list during search
        """
        # Set search parameter
        self.index.set_ef(ef_search)
        
        # Perform search
        labels, distances = self.index.knn_query(query_vectors, k=k)
        
        return labels, distances
    
    def save_index(self, filepath):
        """Save index to disk"""
        self.index.save_index(filepath)
    
    def load_index(self, filepath):
        """Load index from disk"""
        self.index.load_index(filepath)

# Example usage
if __name__ == "__main__":
    # Create sample data
    dimension = 128
    num_vectors = 10000
    
    # Generate random vectors
    vectors = np.random.random((num_vectors, dimension)).astype('float32')
    
    # Create index with custom parameters
    index = HNSWIndex(
        dimension=dimension,
        max_elements=num_vectors,
        M=32,  # More connections for better recall
        ef_construction=400  # Thorough construction
    )
    
    # Add vectors
    print("Building index...")
    index.add_vectors(vectors)
    
    # Create query
    query = np.random.random((1, dimension)).astype('float32')
    
    # Search with different ef_search values
    for ef in [10, 50, 100, 200]:
        labels, distances = index.search(
            query, 
            k=10, 
            ef_search=ef
        )
        print(f"\nef_search={ef}")
        print(f"Nearest neighbors: {labels[0]}")
        print(f"Distances: {distances[0]}")
    
    # Save index
    index.save_index("hnsw_index.bin")

Performance Characteristics

HNSW achieves logarithmic time complexity for search operations, making it remarkably scalable. Search time grows slowly even as the dataset expands from thousands to billions of vectors. This is fundamentally different from brute-force search, which scales linearly.

The memory footprint is the main limitation. HNSW stores the full graph structure in memory, requiring roughly M times 8 bytes per vector for connections, plus the original vector data. For a billion vectors with M equals 32, this means approximately 256 gigabytes just for the graph structure.

Construction time can be significant. Building an HNSW index over 100 million vectors might take hours, depending on parameters. However, incremental updates work efficiently. Adding new vectors does not require rebuilding the entire index.

Indexing Algorithms: IVF-PQ Deep Dive

Inverted File with Product Quantization represents a different architectural approach optimizing for memory efficiency over pure speed. Where HNSW prioritizes search performance, IVF-PQ dramatically reduces memory footprint while maintaining acceptable accuracy.

The Two-Stage Approach

IVF-PQ combines two complementary techniques. The Inverted File stage partitions the vector space using k-means clustering. Instead of searching all vectors, queries are compared only against vectors in the nearest clusters. This coarse filtering reduces the search space dramatically.

Product Quantization provides the compression. Each vector is divided into subvectors, and each subvector is quantized by assigning it to the nearest centroid in a learned codebook. Instead of storing the full vector, we store only the centroid IDs. This achieves compression ratios of 32x to 128x while enabling fast distance computation.

How Product Quantization Works

Consider a 128-dimensional vector. Product quantization splits it into 8 subvectors of 16 dimensions each. For each subspace, we learn 256 centroids using k-means. Each subvector gets replaced by the ID of its nearest centroid, just 1 byte instead of 64 bytes for the original subvector.

The entire 128-dimensional vector now requires only 8 bytes instead of 512 bytes, achieving 64x compression. Distance computation between a query vector and database vectors happens through lookup tables, making it extremely fast despite the compression.

Here is a Node.js implementation demonstrating the IVF-PQ concept:

const tf = require('@tensorflow/tfjs-node');

class IVFPQIndex {
    constructor(dimension, nClusters = 256, nSubVectors = 8) {
        this.dimension = dimension;
        this.nClusters = nClusters;
        this.nSubVectors = nSubVectors;
        this.subVectorDim = dimension / nSubVectors;
        
        this.coarseCentroids = null;
        this.invertedLists = null;
        this.pqCodebooks = null;
        this.compressedVectors = null;
    }
    
    async train(vectors) {
        console.log('Training IVF stage...');
        // Step 1: Train coarse quantizer (IVF)
        this.coarseCentroids = await this.trainKMeans(
            vectors, 
            this.nClusters
        );
        
        console.log('Training PQ codebooks...');
        // Step 2: Train product quantization codebooks
        this.pqCodebooks = [];
        for (let i = 0; i < this.nSubVectors; i++) {
            const start = i * this.subVectorDim;
            const end = start + this.subVectorDim;
            
            // Extract subvectors
            const subVectors = vectors.map(v => v.slice(start, end));
            
            // Train codebook for this subspace
            const codebook = await this.trainKMeans(
                subVectors, 
                256  // 256 centroids = 1 byte per subvector
            );
            
            this.pqCodebooks.push(codebook);
        }
        
        console.log('Building inverted lists...');
        // Step 3: Build inverted lists and compress vectors
        this.invertedLists = Array(this.nClusters).fill(null)
            .map(() => ({ ids: [], codes: [] }));
        
        vectors.forEach((vector, idx) => {
            // Find nearest coarse centroid
            const clusterId = this.findNearestCentroid(
                vector, 
                this.coarseCentroids
            );
            
            // Encode vector with PQ
            const pqCode = this.encodeVector(vector);
            
            // Add to inverted list
            this.invertedLists[clusterId].ids.push(idx);
            this.invertedLists[clusterId].codes.push(pqCode);
        });
    }
    
    encodeVector(vector) {
        const code = [];
        for (let i = 0; i < this.nSubVectors; i++) {
            const start = i * this.subVectorDim;
            const end = start + this.subVectorDim;
            const subVector = vector.slice(start, end);
            
            // Find nearest centroid in this codebook
            const centroidId = this.findNearestCentroid(
                subVector, 
                this.pqCodebooks[i]
            );
            
            code.push(centroidId);
        }
        return code;
    }
    
    async search(query, k = 10, nProbe = 10) {
        // Find nProbe nearest coarse centroids
        const clusterIds = this.findTopKCentroids(
            query, 
            this.coarseCentroids, 
            nProbe
        );
        
        // Build distance lookup tables for PQ
        const lookupTables = [];
        for (let i = 0; i < this.nSubVectors; i++) {
            const start = i * this.subVectorDim;
            const end = start + this.subVectorDim;
            const querySubVector = query.slice(start, end);
            
            // Compute distances to all centroids in this codebook
            const distances = this.pqCodebooks[i].map(centroid => 
                this.euclideanDistance(querySubVector, centroid)
            );
            
            lookupTables.push(distances);
        }
        
        // Search in selected clusters
        const candidates = [];
        for (const clusterId of clusterIds) {
            const list = this.invertedLists[clusterId];
            
            for (let i = 0; i < list.ids.length; i++) {
                const id = list.ids[i];
                const code = list.codes[i];
                
                // Compute approximate distance using lookup tables
                let distance = 0;
                for (let j = 0; j < this.nSubVectors; j++) {
                    distance += lookupTables[j][code[j]];
                }
                
                candidates.push({ id, distance });
            }
        }
        
        // Sort and return top k
        candidates.sort((a, b) => a.distance - b.distance);
        return candidates.slice(0, k);
    }
    
    async trainKMeans(vectors, k, maxIter = 100) {
        // Simple k-means implementation
        const n = vectors.length;
        const dim = vectors[0].length;
        
        // Initialize centroids randomly
        let centroids = vectors
            .slice(0, k)
            .map(v => [...v]);
        
        for (let iter = 0; iter < maxIter; iter++) {
            // Assign vectors to nearest centroid
            const assignments = vectors.map(v => 
                this.findNearestCentroid(v, centroids)
            );
            
            // Update centroids
            const newCentroids = Array(k).fill(null)
                .map(() => Array(dim).fill(0));
            const counts = Array(k).fill(0);
            
            vectors.forEach((v, i) => {
                const clusterId = assignments[i];
                counts[clusterId]++;
                v.forEach((val, j) => {
                    newCentroids[clusterId][j] += val;
                });
            });
            
            // Average
            for (let i = 0; i < k; i++) {
                if (counts[i] > 0) {
                    newCentroids[i] = newCentroids[i].map(
                        val => val / counts[i]
                    );
                }
            }
            
            centroids = newCentroids;
        }
        
        return centroids;
    }
    
    findNearestCentroid(vector, centroids) {
        let minDist = Infinity;
        let nearestIdx = 0;
        
        centroids.forEach((centroid, idx) => {
            const dist = this.euclideanDistance(vector, centroid);
            if (dist < minDist) {
                minDist = dist;
                nearestIdx = idx;
            }
        });
        
        return nearestIdx;
    }
    
    findTopKCentroids(vector, centroids, k) {
        const distances = centroids.map((centroid, idx) => ({
            idx,
            dist: this.euclideanDistance(vector, centroid)
        }));
        
        distances.sort((a, b) => a.dist - b.dist);
        return distances.slice(0, k).map(d => d.idx);
    }
    
    euclideanDistance(a, b) {
        return Math.sqrt(
            a.reduce((sum, val, i) => 
                sum + Math.pow(val - b[i], 2), 
                0
            )
        );
    }
}

// Example usage
async function main() {
    const dimension = 128;
    const numVectors = 10000;
    
    // Generate random vectors
    const vectors = Array(numVectors).fill(null)
        .map(() => Array(dimension).fill(0)
            .map(() => Math.random())
        );
    
    // Create and train index
    const index = new IVFPQIndex(dimension, 256, 8);
    await index.train(vectors);
    
    // Search
    const query = Array(dimension).fill(0).map(() => Math.random());
    const results = await index.search(query, 10, 20);
    
    console.log('Top 10 results:', results);
    
    // Calculate memory savings
    const originalSize = numVectors * dimension * 4; // 4 bytes per float
    const compressedSize = numVectors * 8; // 8 bytes per vector
    const compressionRatio = originalSize / compressedSize;
    
    console.log(`\nMemory usage:`);
    console.log(`Original: ${(originalSize / 1024 / 1024).toFixed(2)} MB`);
    console.log(`Compressed: ${(compressedSize / 1024 / 1024).toFixed(2)} MB`);
    console.log(`Compression ratio: ${compressionRatio.toFixed(1)}x`);
}

main();

Performance Tradeoffs

IVF-PQ excels at memory efficiency. A billion 768-dimensional vectors normally require 3 terabytes of storage. With IVF-PQ compression at 64x, this drops to roughly 48 gigabytes. This dramatic reduction makes billion-scale search feasible on commodity hardware.

The accuracy cost is significant. Product quantization is lossy compression. Information is lost when replacing subvectors with centroid IDs. Typical configurations achieve 90-95% recall, meaning 5-10% of true nearest neighbors are missed. For many applications, this tradeoff is acceptable, especially when combined with refinement reranking.

Search speed depends heavily on the number of clusters probed. Probing more clusters improves recall but increases computation. Production systems often start with IVF-PQ for initial retrieval, then refine results by computing exact distances on a small candidate set.

graph TB
    A[IVF-PQ Architecture] --> B[IVF Stage]
    A --> C[PQ Stage]
    
    B --> B1[K-means Clustering]
    B1 --> B2[Create Voronoi Cells]
    B2 --> B3[Build Inverted Lists]
    B3 --> B4[Coarse Filtering]
    
    C --> C1[Split into Subvectors]
    C1 --> C2[Train Codebooks]
    C2 --> C3[Encode Subvectors]
    C3 --> C4[Store Centroid IDs]
    
    D[Search Process] --> E[Query Vector]
    E --> F[Find Nearest Clusters]
    F --> G[Select nProbe Clusters]
    G --> H[Build Lookup Tables]
    H --> I[Compute PQ Distances]
    I --> J[Rank Candidates]
    J --> K[Optional: Refine with Exact]
    K --> L[Return Top K]
    
    M[Compression Example] --> M1[768-dim vector]
    M1 --> M2[Split into 32 subvectors]
    M2 --> M3[24 dims each]
    M3 --> M4[Encode each with 1 byte]
    M4 --> M5[32 bytes total]
    M5 --> M6[96x compression]
    
    style A fill:#e1f5ff
    style D fill:#ffe1e1
    style M fill:#e1ffe1

Hybrid Search: Combining Approaches

Production systems rarely use pure vector search. The most effective architectures combine vector similarity with traditional keyword search and metadata filtering. This hybrid approach addresses the limitations of each method individually.

Why Hybrid Search Matters

Vector search excels at semantic understanding but struggles with exact matches. Searching for a specific product code, a person’s name, or an error number works poorly with pure semantic search. The system might return semantically similar but factually wrong results.

Keyword search handles exact matches perfectly but misses semantic relationships. Searching for “laptop battery issues” will not find documents about “notebook power problems” even though they discuss the same topic.

Metadata filters provide precise control. Users might want documents from a specific date range, author, or category. Combining these filters with semantic search produces results that are both relevant and meet specific criteria.

Implementation Strategies

Hybrid search can be implemented through several architectural patterns. The pre-filter approach applies metadata filters first, then performs vector search on the filtered subset. This works well when filters are highly selective.

The post-filter approach performs vector search first, then applies filters to the results. This suits cases where filters are less selective or when you need the top results regardless of filter criteria.

Score fusion combines vector similarity scores with keyword match scores. Reciprocal rank fusion is a popular technique that merges ranked lists from different retrieval methods. The final ranking considers both semantic relevance and keyword matches.

Here is a C# implementation showing hybrid search with Azure Cognitive Search:

using Azure;
using Azure.Search.Documents;
using Azure.Search.Documents.Indexes;
using Azure.Search.Documents.Models;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;

public class HybridSearchService
{
    private readonly SearchClient _searchClient;
    private readonly string _vectorFieldName = "contentVector";
    
    public HybridSearchService(string endpoint, string indexName, string apiKey)
    {
        var credential = new AzureKeyCredential(apiKey);
        _searchClient = new SearchClient(
            new Uri(endpoint), 
            indexName, 
            credential
        );
    }
    
    public async Task<SearchResults<Document>> HybridSearchAsync(
        string searchText,
        float[] queryVector,
        Dictionary<string, object> filters = null,
        int top = 10,
        double vectorWeight = 0.5)
    {
        // Build search options
        var searchOptions = new SearchOptions
        {
            Size = top,
            Select = { "id", "title", "content", "category", "date" },
        };
        
        // Add filters if provided
        if (filters != null && filters.Any())
        {
            var filterExpressions = filters.Select(f => 
                $"{f.Key} eq '{f.Value}'"
            );
            searchOptions.Filter = string.Join(" and ", filterExpressions);
        }
        
        // Configure vector search
        var vectorQuery = new VectorizedQuery(queryVector)
        {
            KNearestNeighborsCount = top * 2,  // Over-retrieve for fusion
            Fields = { _vectorFieldName }
        };
        
        searchOptions.VectorSearch = new VectorSearchOptions
        {
            Queries = { vectorQuery }
        };
        
        // Execute hybrid search
        var response = await _searchClient.SearchAsync<Document>(
            searchText,
            searchOptions
        );
        
        return response.Value;
    }
    
    public async Task<List<ScoredDocument>> HybridSearchWithFusionAsync(
        string searchText,
        float[] queryVector,
        int top = 10)
    {
        // Perform keyword search
        var keywordOptions = new SearchOptions
        {
            Size = top * 2,
            Select = { "id", "title", "content" }
        };
        
        var keywordResults = await _searchClient.SearchAsync<Document>(
            searchText,
            keywordOptions
        );
        
        // Perform vector search
        var vectorOptions = new SearchOptions
        {
            Size = top * 2,
            Select = { "id", "title", "content" }
        };
        
        var vectorQuery = new VectorizedQuery(queryVector)
        {
            KNearestNeighborsCount = top * 2,
            Fields = { _vectorFieldName }
        };
        
        vectorOptions.VectorSearch = new VectorSearchOptions
        {
            Queries = { vectorQuery }
        };
        
        var vectorResults = await _searchClient.SearchAsync<Document>(
            null,  // No text query for pure vector search
            vectorOptions
        );
        
        // Perform Reciprocal Rank Fusion
        var fusedResults = PerformReciprocalRankFusion(
            await keywordResults.Value.GetResultsAsync().ToListAsync(),
            await vectorResults.Value.GetResultsAsync().ToListAsync(),
            top
        );
        
        return fusedResults;
    }
    
    private List<ScoredDocument> PerformReciprocalRankFusion(
        List<SearchResult<Document>> keywordResults,
        List<SearchResult<Document>> vectorResults,
        int top,
        int k = 60)
    {
        // Create rank maps
        var keywordRanks = keywordResults
            .Select((result, index) => new { 
                Id = result.Document.Id, 
                Rank = index + 1 
            })
            .ToDictionary(x => x.Id, x => x.Rank);
        
        var vectorRanks = vectorResults
            .Select((result, index) => new { 
                Id = result.Document.Id, 
                Rank = index + 1 
            })
            .ToDictionary(x => x.Id, x => x.Rank);
        
        // Combine all document IDs
        var allIds = keywordRanks.Keys
            .Union(vectorRanks.Keys)
            .ToList();
        
        // Calculate RRF scores
        var rrfScores = allIds.Select(id =>
        {
            double score = 0;
            
            if (keywordRanks.ContainsKey(id))
                score += 1.0 / (k + keywordRanks[id]);
            
            if (vectorRanks.ContainsKey(id))
                score += 1.0 / (k + vectorRanks[id]);
            
            // Get document from either result set
            var doc = keywordResults
                .FirstOrDefault(r => r.Document.Id == id)?.Document
                ?? vectorResults
                .FirstOrDefault(r => r.Document.Id == id)?.Document;
            
            return new ScoredDocument
            {
                Document = doc,
                Score = score,
                KeywordRank = keywordRanks.ContainsKey(id) 
                    ? keywordRanks[id] : (int?)null,
                VectorRank = vectorRanks.ContainsKey(id) 
                    ? vectorRanks[id] : (int?)null
            };
        })
        .OrderByDescending(x => x.Score)
        .Take(top)
        .ToList();
        
        return rrfScores;
    }
    
    public async Task<List<Document>> SearchWithReranking(
        string searchText,
        float[] queryVector,
        int initialTop = 100,
        int finalTop = 10)
    {
        // Initial retrieval with vector search
        var options = new SearchOptions
        {
            Size = initialTop,
            Select = { "id", "title", "content" }
        };
        
        var vectorQuery = new VectorizedQuery(queryVector)
        {
            KNearestNeighborsCount = initialTop,
            Fields = { _vectorFieldName }
        };
        
        options.VectorSearch = new VectorSearchOptions
        {
            Queries = { vectorQuery }
        };
        
        var results = await _searchClient.SearchAsync<Document>(
            searchText,
            options
        );
        
        var candidates = await results.Value
            .GetResultsAsync()
            .ToListAsync();
        
        // Rerank using exact distance computation
        // In production, this might call a reranking model
        var reranked = candidates
            .Select(r => new
            {
                Document = r.Document,
                Score = ComputeExactSimilarity(
                    queryVector, 
                    r.Document.ContentVector
                )
            })
            .OrderByDescending(x => x.Score)
            .Take(finalTop)
            .Select(x => x.Document)
            .ToList();
        
        return reranked;
    }
    
    private double ComputeExactSimilarity(float[] a, float[] b)
    {
        // Cosine similarity
        double dotProduct = 0;
        double normA = 0;
        double normB = 0;
        
        for (int i = 0; i < a.Length; i++)
        {
            dotProduct += a[i] * b[i];
            normA += a[i] * a[i];
            normB += b[i] * b[i];
        }
        
        return dotProduct / (Math.Sqrt(normA) * Math.Sqrt(normB));
    }
}

public class Document
{
    public string Id { get; set; }
    public string Title { get; set; }
    public string Content { get; set; }
    public string Category { get; set; }
    public DateTime Date { get; set; }
    public float[] ContentVector { get; set; }
}

public class ScoredDocument
{
    public Document Document { get; set; }
    public double Score { get; set; }
    public int? KeywordRank { get; set; }
    public int? VectorRank { get; set; }
}

Storage and Compression Techniques

Efficient storage determines whether your vector database fits in memory, loads quickly, and scales to billions of vectors. Several techniques optimize storage without sacrificing too much accuracy.

Quantization Methods

Scalar quantization reduces precision by converting 32-bit floats to 8-bit integers. Each dimension gets mapped to a range of 256 values. This achieves 4x compression with minimal accuracy loss for many applications. The conversion is simple and fast, making it practical for real-time systems.

Product quantization, as discussed with IVF-PQ, provides more aggressive compression. By quantizing subvectors instead of individual dimensions, it captures correlations between dimensions. Compression ratios of 32x to 128x are common, though with more accuracy impact than scalar quantization.

Binary quantization converts vectors to binary codes where each dimension becomes a single bit. This achieves 32x compression and enables extremely fast distance computation using bit operations. The accuracy loss is substantial, making it suitable mainly for initial filtering stages.

Sharding Strategies

Sharding distributes vectors across multiple machines, enabling horizontal scaling. Hash-based sharding assigns vectors to shards using a hash function on the vector ID. This provides even distribution but requires querying all shards for each search.

Locality-sensitive sharding attempts to place similar vectors on the same shard. Vectors that cluster together reside on the same machine, reducing the number of shards queried. However, maintaining balanced load across shards becomes challenging.

Replication provides fault tolerance and increased query throughput. Each shard is replicated across multiple machines. Queries can be distributed across replicas, and the system remains operational if a machine fails.

graph TB
    A[Storage Optimization] --> B[Quantization]
    A --> C[Sharding]
    A --> D[Compression]
    
    B --> B1[Scalar: 4x]
    B --> B2[Product: 32-128x]
    B --> B3[Binary: 32x]
    
    C --> C1[Hash-based]
    C --> C2[Locality-sensitive]
    C --> C3[Replication]
    
    D --> D1[Index Compression]
    D --> D2[Vector Compression]
    D --> D3[Metadata Compression]
    
    E[Example: 1B vectors] --> F[768 dimensions]
    F --> G[Original: 3TB]
    G --> H[Scalar Quant: 750GB]
    G --> I[PQ 32x: 96GB]
    G --> J[PQ 64x: 48GB]
    
    K[Accuracy Impact] --> L[Scalar: ~99%]
    K --> M[PQ 32x: ~95%]
    K --> N[PQ 64x: ~90%]
    K --> O[Binary: ~75%]
    
    style A fill:#e1f5ff
    style E fill:#ffe1e1
    style K fill:#e1ffe1

What’s Next

This part explored the architectural foundations of vector databases: native versus multimodel approaches, the algorithms that power billion-scale search, and the storage techniques that make it practical. You now understand how HNSW achieves logarithmic search complexity, how IVF-PQ compresses vectors by 64x while maintaining accuracy, and how hybrid search combines the strengths of multiple retrieval methods.

Part 3 will evaluate the major players in the vector database landscape. We will compare Pinecone, Milvus, Weaviate, Qdrant, Chroma, and pgvector across performance benchmarks, developer experience, and production readiness. You will learn when to choose each solution and how Azure-specific options like SQL Server 2025 and Cosmos DB fit into the ecosystem.

The technical foundation is complete. Now we move to practical selection and implementation.

References

Written by:

490 Posts

View All Posts
Follow Me :