In Part 1, we explored what vector databases are and why they have become essential infrastructure for AI applications. Now we dive into the technical depths: how these databases actually work, the algorithms that power billion-scale searches, and the architectural decisions that separate production-ready systems from proof-of-concept demos.
Understanding vector database architecture is not just academic exercise. The choices you make about indexing algorithms, storage formats, and search strategies directly impact whether your system returns results in 10 milliseconds or 10 seconds, whether it consumes 10 gigabytes or 10 terabytes of memory, and whether it achieves 95% recall or 60% recall on your queries.
Native vs Multimodel Vector Databases
The vector database landscape divides into two fundamental categories: native vector databases built from the ground up for high-dimensional similarity search, and multimodel databases that added vector capabilities to existing systems.
Native Vector Databases
Native vector databases like Pinecone, Milvus, Weaviate, and Qdrant were designed specifically for vector operations. Their entire architecture optimizes for storing, indexing, and searching high-dimensional embeddings. Every component, from storage layout to query execution, assumes vector data as the primary workload.
The advantages are substantial. Native systems can make aggressive optimizations impossible in general-purpose databases. They implement specialized index structures like HNSW graphs that require tight integration between storage and query processing. Memory management strategies assume vector-specific access patterns. The entire system can be tuned for the unique characteristics of similarity search.
However, native vector databases introduce operational complexity. You now have another database to manage, monitor, and maintain. Data synchronization becomes necessary if you need to combine vector search with traditional database operations. The learning curve for teams familiar with SQL databases can be steep.
Multimodel Databases
Multimodel databases added vector search to existing systems. PostgreSQL with pgvector, Azure SQL Server 2025, Elasticsearch, MongoDB Atlas, and Redis all now support vector operations alongside their traditional functionality.
The appeal is obvious. Teams already running PostgreSQL do not need to introduce a new database. Vector data lives alongside relational data, eliminating synchronization complexity. A single query can combine traditional filters with vector similarity search. The operational burden stays the same.
The tradeoff is performance. Multimodel databases must balance vector operations with their primary workload. Index structures cannot be as aggressive because they need to coexist with B-trees and other traditional indexes. Memory management serves multiple data types. Query optimization considers many access patterns, not just vector similarity.
For many production systems, multimodel databases provide the right balance. The performance gap has narrowed considerably. PostgreSQL with pgvector handles millions of vectors efficiently. Azure SQL Server 2025 brings enterprise-grade vector capabilities. Unless you are operating at massive scale or need the absolute fastest performance, multimodel solutions often make architectural sense.
graph TB
A[Vector Database Types] --> B[Native Vector Databases]
A --> C[Multimodel Databases]
B --> D[Pinecone]
B --> E[Milvus]
B --> F[Weaviate]
B --> G[Qdrant]
B --> H[Chroma]
C --> I[PostgreSQL + pgvector]
C --> J[Azure SQL Server 2025]
C --> K[Elasticsearch]
C --> L[MongoDB Atlas]
C --> M[Redis]
B --> N[Advantages]
N --> N1[Optimized Performance]
N --> N2[Advanced Algorithms]
N --> N3[Vector-First Design]
B --> O[Tradeoffs]
O --> O1[Operational Complexity]
O --> O2[Data Synchronization]
O --> O3[Additional System]
C --> P[Advantages]
P --> P1[Single Database]
P --> P2[Familiar Operations]
P --> P3[Combined Queries]
C --> Q[Tradeoffs]
Q --> Q1[Performance Compromises]
Q --> Q2[Feature Lag]
Q --> Q3[Shared Resources]
style A fill:#e1f5ff
style B fill:#ffe1e1
style C fill:#e1ffe1Indexing Algorithms: HNSW Deep Dive
Hierarchical Navigable Small World graphs represent the state-of-the-art for approximate nearest neighbor search. Understanding HNSW is essential because it powers most production vector databases, from Pinecone to pgvector.
The Core Concept: Navigable Small Worlds
HNSW builds on the small world phenomenon. Just as any two people can be connected through roughly six degrees of separation, HNSW constructs graphs where any two vectors can be reached through a small number of hops. The graph combines long-range connections for rapid navigation with short-range connections for precise local search.
Think of it like navigating a city. Interstate highways get you quickly to the right region. Local roads bring you to the right neighborhood. Side streets lead to the exact address. HNSW implements this same hierarchical navigation in high-dimensional space.
The Hierarchical Structure
HNSW creates multiple layers of graphs stacked on top of each other. The bottom layer contains all vectors. Each upper layer contains a progressively smaller subset, sampled with exponentially decaying probability. The top layer might have only a handful of vectors, while the bottom layer contains millions.
Each layer is a navigable small world graph where vectors connect to their nearest neighbors. Upper layers have longer connections spanning greater distances. Lower layers have denser, shorter connections providing fine-grained accuracy.
Search begins at the top layer with a fixed entry point. The algorithm greedily navigates to the nearest neighbor at each step. When no closer neighbor exists in the current layer, search descends to the layer below and continues. This process repeats until reaching the bottom layer, where the final nearest neighbors are identified.
graph TB
A[HNSW Structure] --> B[Layer 3: Sparse]
A --> C[Layer 2: Medium Density]
A --> D[Layer 1: Dense]
A --> E[Layer 0: All Vectors]
B --> B1[2-4 vectors]
B --> B2[Long connections]
B --> B3[Fast navigation]
C --> C1[Hundreds of vectors]
C --> C2[Medium connections]
D --> D1[Thousands of vectors]
D --> D2[Short connections]
E --> E1[All data points]
E --> E2[Precise connections]
E --> E3[Final refinement]
F[Search Process] --> G[Start at top layer]
G --> H[Greedy navigation]
H --> I[Move to nearest neighbor]
I --> J{Found local minimum?}
J -->|No| I
J -->|Yes| K[Descend to next layer]
K --> L{At bottom layer?}
L -->|No| H
L -->|Yes| M[Return k nearest neighbors]
style A fill:#e1f5ff
style F fill:#ffe1e1
style M fill:#e1ffe1Key Parameters
HNSW performance depends on three critical parameters that control the tradeoff between speed, accuracy, and memory usage.
M determines the maximum number of connections per node in each layer. Higher M creates denser graphs with better recall but increases memory usage and construction time. Typical values range from 16 to 64. For most applications, M equals 16 provides good balance.
efConstruction controls thoroughness during index building. The algorithm maintains a dynamic list of efConstruction candidate neighbors when inserting new vectors. Higher values produce better quality graphs but slow construction. Common values range from 100 to 400.
efSearch determines search thoroughness at query time. The algorithm explores efSearch candidates at each layer. Higher values improve recall at the cost of latency. Unlike efConstruction, efSearch can be tuned at query time, allowing dynamic tradeoffs between speed and accuracy.
Here is a Python implementation demonstrating HNSW with configurable parameters:
import numpy as np
import hnswlib
class HNSWIndex:
def __init__(self, dimension, max_elements=10000, M=16, ef_construction=200):
"""
Initialize HNSW index
Args:
dimension: Vector dimensionality
max_elements: Maximum number of vectors
M: Maximum number of connections per node
ef_construction: Size of dynamic candidate list during construction
"""
self.dimension = dimension
self.index = hnswlib.Index(space='cosine', dim=dimension)
# Initialize index
self.index.init_index(
max_elements=max_elements,
M=M,
ef_construction=ef_construction,
random_seed=42
)
self.current_count = 0
def add_vectors(self, vectors, ids=None):
"""Add vectors to the index"""
if ids is None:
ids = np.arange(self.current_count,
self.current_count + len(vectors))
self.index.add_items(vectors, ids)
self.current_count += len(vectors)
return ids
def search(self, query_vectors, k=10, ef_search=50):
"""
Search for k nearest neighbors
Args:
query_vectors: Query vectors
k: Number of nearest neighbors
ef_search: Size of dynamic candidate list during search
"""
# Set search parameter
self.index.set_ef(ef_search)
# Perform search
labels, distances = self.index.knn_query(query_vectors, k=k)
return labels, distances
def save_index(self, filepath):
"""Save index to disk"""
self.index.save_index(filepath)
def load_index(self, filepath):
"""Load index from disk"""
self.index.load_index(filepath)
# Example usage
if __name__ == "__main__":
# Create sample data
dimension = 128
num_vectors = 10000
# Generate random vectors
vectors = np.random.random((num_vectors, dimension)).astype('float32')
# Create index with custom parameters
index = HNSWIndex(
dimension=dimension,
max_elements=num_vectors,
M=32, # More connections for better recall
ef_construction=400 # Thorough construction
)
# Add vectors
print("Building index...")
index.add_vectors(vectors)
# Create query
query = np.random.random((1, dimension)).astype('float32')
# Search with different ef_search values
for ef in [10, 50, 100, 200]:
labels, distances = index.search(
query,
k=10,
ef_search=ef
)
print(f"\nef_search={ef}")
print(f"Nearest neighbors: {labels[0]}")
print(f"Distances: {distances[0]}")
# Save index
index.save_index("hnsw_index.bin")Performance Characteristics
HNSW achieves logarithmic time complexity for search operations, making it remarkably scalable. Search time grows slowly even as the dataset expands from thousands to billions of vectors. This is fundamentally different from brute-force search, which scales linearly.
The memory footprint is the main limitation. HNSW stores the full graph structure in memory, requiring roughly M times 8 bytes per vector for connections, plus the original vector data. For a billion vectors with M equals 32, this means approximately 256 gigabytes just for the graph structure.
Construction time can be significant. Building an HNSW index over 100 million vectors might take hours, depending on parameters. However, incremental updates work efficiently. Adding new vectors does not require rebuilding the entire index.
Indexing Algorithms: IVF-PQ Deep Dive
Inverted File with Product Quantization represents a different architectural approach optimizing for memory efficiency over pure speed. Where HNSW prioritizes search performance, IVF-PQ dramatically reduces memory footprint while maintaining acceptable accuracy.
The Two-Stage Approach
IVF-PQ combines two complementary techniques. The Inverted File stage partitions the vector space using k-means clustering. Instead of searching all vectors, queries are compared only against vectors in the nearest clusters. This coarse filtering reduces the search space dramatically.
Product Quantization provides the compression. Each vector is divided into subvectors, and each subvector is quantized by assigning it to the nearest centroid in a learned codebook. Instead of storing the full vector, we store only the centroid IDs. This achieves compression ratios of 32x to 128x while enabling fast distance computation.
How Product Quantization Works
Consider a 128-dimensional vector. Product quantization splits it into 8 subvectors of 16 dimensions each. For each subspace, we learn 256 centroids using k-means. Each subvector gets replaced by the ID of its nearest centroid, just 1 byte instead of 64 bytes for the original subvector.
The entire 128-dimensional vector now requires only 8 bytes instead of 512 bytes, achieving 64x compression. Distance computation between a query vector and database vectors happens through lookup tables, making it extremely fast despite the compression.
Here is a Node.js implementation demonstrating the IVF-PQ concept:
const tf = require('@tensorflow/tfjs-node');
class IVFPQIndex {
constructor(dimension, nClusters = 256, nSubVectors = 8) {
this.dimension = dimension;
this.nClusters = nClusters;
this.nSubVectors = nSubVectors;
this.subVectorDim = dimension / nSubVectors;
this.coarseCentroids = null;
this.invertedLists = null;
this.pqCodebooks = null;
this.compressedVectors = null;
}
async train(vectors) {
console.log('Training IVF stage...');
// Step 1: Train coarse quantizer (IVF)
this.coarseCentroids = await this.trainKMeans(
vectors,
this.nClusters
);
console.log('Training PQ codebooks...');
// Step 2: Train product quantization codebooks
this.pqCodebooks = [];
for (let i = 0; i < this.nSubVectors; i++) {
const start = i * this.subVectorDim;
const end = start + this.subVectorDim;
// Extract subvectors
const subVectors = vectors.map(v => v.slice(start, end));
// Train codebook for this subspace
const codebook = await this.trainKMeans(
subVectors,
256 // 256 centroids = 1 byte per subvector
);
this.pqCodebooks.push(codebook);
}
console.log('Building inverted lists...');
// Step 3: Build inverted lists and compress vectors
this.invertedLists = Array(this.nClusters).fill(null)
.map(() => ({ ids: [], codes: [] }));
vectors.forEach((vector, idx) => {
// Find nearest coarse centroid
const clusterId = this.findNearestCentroid(
vector,
this.coarseCentroids
);
// Encode vector with PQ
const pqCode = this.encodeVector(vector);
// Add to inverted list
this.invertedLists[clusterId].ids.push(idx);
this.invertedLists[clusterId].codes.push(pqCode);
});
}
encodeVector(vector) {
const code = [];
for (let i = 0; i < this.nSubVectors; i++) {
const start = i * this.subVectorDim;
const end = start + this.subVectorDim;
const subVector = vector.slice(start, end);
// Find nearest centroid in this codebook
const centroidId = this.findNearestCentroid(
subVector,
this.pqCodebooks[i]
);
code.push(centroidId);
}
return code;
}
async search(query, k = 10, nProbe = 10) {
// Find nProbe nearest coarse centroids
const clusterIds = this.findTopKCentroids(
query,
this.coarseCentroids,
nProbe
);
// Build distance lookup tables for PQ
const lookupTables = [];
for (let i = 0; i < this.nSubVectors; i++) {
const start = i * this.subVectorDim;
const end = start + this.subVectorDim;
const querySubVector = query.slice(start, end);
// Compute distances to all centroids in this codebook
const distances = this.pqCodebooks[i].map(centroid =>
this.euclideanDistance(querySubVector, centroid)
);
lookupTables.push(distances);
}
// Search in selected clusters
const candidates = [];
for (const clusterId of clusterIds) {
const list = this.invertedLists[clusterId];
for (let i = 0; i < list.ids.length; i++) {
const id = list.ids[i];
const code = list.codes[i];
// Compute approximate distance using lookup tables
let distance = 0;
for (let j = 0; j < this.nSubVectors; j++) {
distance += lookupTables[j][code[j]];
}
candidates.push({ id, distance });
}
}
// Sort and return top k
candidates.sort((a, b) => a.distance - b.distance);
return candidates.slice(0, k);
}
async trainKMeans(vectors, k, maxIter = 100) {
// Simple k-means implementation
const n = vectors.length;
const dim = vectors[0].length;
// Initialize centroids randomly
let centroids = vectors
.slice(0, k)
.map(v => [...v]);
for (let iter = 0; iter < maxIter; iter++) {
// Assign vectors to nearest centroid
const assignments = vectors.map(v =>
this.findNearestCentroid(v, centroids)
);
// Update centroids
const newCentroids = Array(k).fill(null)
.map(() => Array(dim).fill(0));
const counts = Array(k).fill(0);
vectors.forEach((v, i) => {
const clusterId = assignments[i];
counts[clusterId]++;
v.forEach((val, j) => {
newCentroids[clusterId][j] += val;
});
});
// Average
for (let i = 0; i < k; i++) {
if (counts[i] > 0) {
newCentroids[i] = newCentroids[i].map(
val => val / counts[i]
);
}
}
centroids = newCentroids;
}
return centroids;
}
findNearestCentroid(vector, centroids) {
let minDist = Infinity;
let nearestIdx = 0;
centroids.forEach((centroid, idx) => {
const dist = this.euclideanDistance(vector, centroid);
if (dist < minDist) {
minDist = dist;
nearestIdx = idx;
}
});
return nearestIdx;
}
findTopKCentroids(vector, centroids, k) {
const distances = centroids.map((centroid, idx) => ({
idx,
dist: this.euclideanDistance(vector, centroid)
}));
distances.sort((a, b) => a.dist - b.dist);
return distances.slice(0, k).map(d => d.idx);
}
euclideanDistance(a, b) {
return Math.sqrt(
a.reduce((sum, val, i) =>
sum + Math.pow(val - b[i], 2),
0
)
);
}
}
// Example usage
async function main() {
const dimension = 128;
const numVectors = 10000;
// Generate random vectors
const vectors = Array(numVectors).fill(null)
.map(() => Array(dimension).fill(0)
.map(() => Math.random())
);
// Create and train index
const index = new IVFPQIndex(dimension, 256, 8);
await index.train(vectors);
// Search
const query = Array(dimension).fill(0).map(() => Math.random());
const results = await index.search(query, 10, 20);
console.log('Top 10 results:', results);
// Calculate memory savings
const originalSize = numVectors * dimension * 4; // 4 bytes per float
const compressedSize = numVectors * 8; // 8 bytes per vector
const compressionRatio = originalSize / compressedSize;
console.log(`\nMemory usage:`);
console.log(`Original: ${(originalSize / 1024 / 1024).toFixed(2)} MB`);
console.log(`Compressed: ${(compressedSize / 1024 / 1024).toFixed(2)} MB`);
console.log(`Compression ratio: ${compressionRatio.toFixed(1)}x`);
}
main();Performance Tradeoffs
IVF-PQ excels at memory efficiency. A billion 768-dimensional vectors normally require 3 terabytes of storage. With IVF-PQ compression at 64x, this drops to roughly 48 gigabytes. This dramatic reduction makes billion-scale search feasible on commodity hardware.
The accuracy cost is significant. Product quantization is lossy compression. Information is lost when replacing subvectors with centroid IDs. Typical configurations achieve 90-95% recall, meaning 5-10% of true nearest neighbors are missed. For many applications, this tradeoff is acceptable, especially when combined with refinement reranking.
Search speed depends heavily on the number of clusters probed. Probing more clusters improves recall but increases computation. Production systems often start with IVF-PQ for initial retrieval, then refine results by computing exact distances on a small candidate set.
graph TB
A[IVF-PQ Architecture] --> B[IVF Stage]
A --> C[PQ Stage]
B --> B1[K-means Clustering]
B1 --> B2[Create Voronoi Cells]
B2 --> B3[Build Inverted Lists]
B3 --> B4[Coarse Filtering]
C --> C1[Split into Subvectors]
C1 --> C2[Train Codebooks]
C2 --> C3[Encode Subvectors]
C3 --> C4[Store Centroid IDs]
D[Search Process] --> E[Query Vector]
E --> F[Find Nearest Clusters]
F --> G[Select nProbe Clusters]
G --> H[Build Lookup Tables]
H --> I[Compute PQ Distances]
I --> J[Rank Candidates]
J --> K[Optional: Refine with Exact]
K --> L[Return Top K]
M[Compression Example] --> M1[768-dim vector]
M1 --> M2[Split into 32 subvectors]
M2 --> M3[24 dims each]
M3 --> M4[Encode each with 1 byte]
M4 --> M5[32 bytes total]
M5 --> M6[96x compression]
style A fill:#e1f5ff
style D fill:#ffe1e1
style M fill:#e1ffe1Hybrid Search: Combining Approaches
Production systems rarely use pure vector search. The most effective architectures combine vector similarity with traditional keyword search and metadata filtering. This hybrid approach addresses the limitations of each method individually.
Why Hybrid Search Matters
Vector search excels at semantic understanding but struggles with exact matches. Searching for a specific product code, a person’s name, or an error number works poorly with pure semantic search. The system might return semantically similar but factually wrong results.
Keyword search handles exact matches perfectly but misses semantic relationships. Searching for “laptop battery issues” will not find documents about “notebook power problems” even though they discuss the same topic.
Metadata filters provide precise control. Users might want documents from a specific date range, author, or category. Combining these filters with semantic search produces results that are both relevant and meet specific criteria.
Implementation Strategies
Hybrid search can be implemented through several architectural patterns. The pre-filter approach applies metadata filters first, then performs vector search on the filtered subset. This works well when filters are highly selective.
The post-filter approach performs vector search first, then applies filters to the results. This suits cases where filters are less selective or when you need the top results regardless of filter criteria.
Score fusion combines vector similarity scores with keyword match scores. Reciprocal rank fusion is a popular technique that merges ranked lists from different retrieval methods. The final ranking considers both semantic relevance and keyword matches.
Here is a C# implementation showing hybrid search with Azure Cognitive Search:
using Azure;
using Azure.Search.Documents;
using Azure.Search.Documents.Indexes;
using Azure.Search.Documents.Models;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
public class HybridSearchService
{
private readonly SearchClient _searchClient;
private readonly string _vectorFieldName = "contentVector";
public HybridSearchService(string endpoint, string indexName, string apiKey)
{
var credential = new AzureKeyCredential(apiKey);
_searchClient = new SearchClient(
new Uri(endpoint),
indexName,
credential
);
}
public async Task<SearchResults<Document>> HybridSearchAsync(
string searchText,
float[] queryVector,
Dictionary<string, object> filters = null,
int top = 10,
double vectorWeight = 0.5)
{
// Build search options
var searchOptions = new SearchOptions
{
Size = top,
Select = { "id", "title", "content", "category", "date" },
};
// Add filters if provided
if (filters != null && filters.Any())
{
var filterExpressions = filters.Select(f =>
$"{f.Key} eq '{f.Value}'"
);
searchOptions.Filter = string.Join(" and ", filterExpressions);
}
// Configure vector search
var vectorQuery = new VectorizedQuery(queryVector)
{
KNearestNeighborsCount = top * 2, // Over-retrieve for fusion
Fields = { _vectorFieldName }
};
searchOptions.VectorSearch = new VectorSearchOptions
{
Queries = { vectorQuery }
};
// Execute hybrid search
var response = await _searchClient.SearchAsync<Document>(
searchText,
searchOptions
);
return response.Value;
}
public async Task<List<ScoredDocument>> HybridSearchWithFusionAsync(
string searchText,
float[] queryVector,
int top = 10)
{
// Perform keyword search
var keywordOptions = new SearchOptions
{
Size = top * 2,
Select = { "id", "title", "content" }
};
var keywordResults = await _searchClient.SearchAsync<Document>(
searchText,
keywordOptions
);
// Perform vector search
var vectorOptions = new SearchOptions
{
Size = top * 2,
Select = { "id", "title", "content" }
};
var vectorQuery = new VectorizedQuery(queryVector)
{
KNearestNeighborsCount = top * 2,
Fields = { _vectorFieldName }
};
vectorOptions.VectorSearch = new VectorSearchOptions
{
Queries = { vectorQuery }
};
var vectorResults = await _searchClient.SearchAsync<Document>(
null, // No text query for pure vector search
vectorOptions
);
// Perform Reciprocal Rank Fusion
var fusedResults = PerformReciprocalRankFusion(
await keywordResults.Value.GetResultsAsync().ToListAsync(),
await vectorResults.Value.GetResultsAsync().ToListAsync(),
top
);
return fusedResults;
}
private List<ScoredDocument> PerformReciprocalRankFusion(
List<SearchResult<Document>> keywordResults,
List<SearchResult<Document>> vectorResults,
int top,
int k = 60)
{
// Create rank maps
var keywordRanks = keywordResults
.Select((result, index) => new {
Id = result.Document.Id,
Rank = index + 1
})
.ToDictionary(x => x.Id, x => x.Rank);
var vectorRanks = vectorResults
.Select((result, index) => new {
Id = result.Document.Id,
Rank = index + 1
})
.ToDictionary(x => x.Id, x => x.Rank);
// Combine all document IDs
var allIds = keywordRanks.Keys
.Union(vectorRanks.Keys)
.ToList();
// Calculate RRF scores
var rrfScores = allIds.Select(id =>
{
double score = 0;
if (keywordRanks.ContainsKey(id))
score += 1.0 / (k + keywordRanks[id]);
if (vectorRanks.ContainsKey(id))
score += 1.0 / (k + vectorRanks[id]);
// Get document from either result set
var doc = keywordResults
.FirstOrDefault(r => r.Document.Id == id)?.Document
?? vectorResults
.FirstOrDefault(r => r.Document.Id == id)?.Document;
return new ScoredDocument
{
Document = doc,
Score = score,
KeywordRank = keywordRanks.ContainsKey(id)
? keywordRanks[id] : (int?)null,
VectorRank = vectorRanks.ContainsKey(id)
? vectorRanks[id] : (int?)null
};
})
.OrderByDescending(x => x.Score)
.Take(top)
.ToList();
return rrfScores;
}
public async Task<List<Document>> SearchWithReranking(
string searchText,
float[] queryVector,
int initialTop = 100,
int finalTop = 10)
{
// Initial retrieval with vector search
var options = new SearchOptions
{
Size = initialTop,
Select = { "id", "title", "content" }
};
var vectorQuery = new VectorizedQuery(queryVector)
{
KNearestNeighborsCount = initialTop,
Fields = { _vectorFieldName }
};
options.VectorSearch = new VectorSearchOptions
{
Queries = { vectorQuery }
};
var results = await _searchClient.SearchAsync<Document>(
searchText,
options
);
var candidates = await results.Value
.GetResultsAsync()
.ToListAsync();
// Rerank using exact distance computation
// In production, this might call a reranking model
var reranked = candidates
.Select(r => new
{
Document = r.Document,
Score = ComputeExactSimilarity(
queryVector,
r.Document.ContentVector
)
})
.OrderByDescending(x => x.Score)
.Take(finalTop)
.Select(x => x.Document)
.ToList();
return reranked;
}
private double ComputeExactSimilarity(float[] a, float[] b)
{
// Cosine similarity
double dotProduct = 0;
double normA = 0;
double normB = 0;
for (int i = 0; i < a.Length; i++)
{
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.Sqrt(normA) * Math.Sqrt(normB));
}
}
public class Document
{
public string Id { get; set; }
public string Title { get; set; }
public string Content { get; set; }
public string Category { get; set; }
public DateTime Date { get; set; }
public float[] ContentVector { get; set; }
}
public class ScoredDocument
{
public Document Document { get; set; }
public double Score { get; set; }
public int? KeywordRank { get; set; }
public int? VectorRank { get; set; }
}Storage and Compression Techniques
Efficient storage determines whether your vector database fits in memory, loads quickly, and scales to billions of vectors. Several techniques optimize storage without sacrificing too much accuracy.
Quantization Methods
Scalar quantization reduces precision by converting 32-bit floats to 8-bit integers. Each dimension gets mapped to a range of 256 values. This achieves 4x compression with minimal accuracy loss for many applications. The conversion is simple and fast, making it practical for real-time systems.
Product quantization, as discussed with IVF-PQ, provides more aggressive compression. By quantizing subvectors instead of individual dimensions, it captures correlations between dimensions. Compression ratios of 32x to 128x are common, though with more accuracy impact than scalar quantization.
Binary quantization converts vectors to binary codes where each dimension becomes a single bit. This achieves 32x compression and enables extremely fast distance computation using bit operations. The accuracy loss is substantial, making it suitable mainly for initial filtering stages.
Sharding Strategies
Sharding distributes vectors across multiple machines, enabling horizontal scaling. Hash-based sharding assigns vectors to shards using a hash function on the vector ID. This provides even distribution but requires querying all shards for each search.
Locality-sensitive sharding attempts to place similar vectors on the same shard. Vectors that cluster together reside on the same machine, reducing the number of shards queried. However, maintaining balanced load across shards becomes challenging.
Replication provides fault tolerance and increased query throughput. Each shard is replicated across multiple machines. Queries can be distributed across replicas, and the system remains operational if a machine fails.
graph TB
A[Storage Optimization] --> B[Quantization]
A --> C[Sharding]
A --> D[Compression]
B --> B1[Scalar: 4x]
B --> B2[Product: 32-128x]
B --> B3[Binary: 32x]
C --> C1[Hash-based]
C --> C2[Locality-sensitive]
C --> C3[Replication]
D --> D1[Index Compression]
D --> D2[Vector Compression]
D --> D3[Metadata Compression]
E[Example: 1B vectors] --> F[768 dimensions]
F --> G[Original: 3TB]
G --> H[Scalar Quant: 750GB]
G --> I[PQ 32x: 96GB]
G --> J[PQ 64x: 48GB]
K[Accuracy Impact] --> L[Scalar: ~99%]
K --> M[PQ 32x: ~95%]
K --> N[PQ 64x: ~90%]
K --> O[Binary: ~75%]
style A fill:#e1f5ff
style E fill:#ffe1e1
style K fill:#e1ffe1What’s Next
This part explored the architectural foundations of vector databases: native versus multimodel approaches, the algorithms that power billion-scale search, and the storage techniques that make it practical. You now understand how HNSW achieves logarithmic search complexity, how IVF-PQ compresses vectors by 64x while maintaining accuracy, and how hybrid search combines the strengths of multiple retrieval methods.
Part 3 will evaluate the major players in the vector database landscape. We will compare Pinecone, Milvus, Weaviate, Qdrant, Chroma, and pgvector across performance benchmarks, developer experience, and production readiness. You will learn when to choose each solution and how Azure-specific options like SQL Server 2025 and Cosmos DB fit into the ecosystem.
The technical foundation is complete. Now we move to practical selection and implementation.
References
- Wikipedia – Hierarchical navigable small world
- Pinecone – Hierarchical Navigable Small Worlds (HNSW)
- Zilliz – Understanding Hierarchical Navigable Small Worlds (HNSW)
- arXiv – Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
- Milvus – Understand HNSW for Vector Search
- Redis – How hierarchical navigable small world (HNSW) algorithms can improve search
- Medium – Inverted File Product Quantization (IVF_PQ)
- Towards Data Science – Similarity Search with IVFPQ
- Milvus Documentation – IVF_PQ
- NVIDIA – Accelerating Vector Search: NVIDIA cuVS IVF-PQ Part 1
- Chris McCormick – Product Quantizers for k-NN Tutorial Part 2
- LanceDB – Understanding LanceDB’s IVF-PQ index
