Parts 1 through 7 built each layer of the memory system individually: episodic storage in PostgreSQL with pgvector, semantic knowledge in Qdrant, procedural learning in C#, consolidation workers in Node.js, multi-agent shared memory with Redis, and a security layer with tenant isolation, PII scrubbing, and audit logging. This final part assembles all of those layers into a single coherent production system. It covers the full reference architecture, infrastructure configuration, monitoring strategy, cost model, and the decision framework for applying memory correctly to real agent problems.
Complete Reference Architecture
The full system has four runtime tiers: the agent tier (LLM-backed agents making memory calls), the memory API tier (the Node.js and Python services from previous parts), the storage tier (PostgreSQL, Qdrant, Redis), and the background tier (consolidation workers and audit verification jobs).
flowchart TD
subgraph Agents["Agent Tier"]
AG1["Agent Instance 1"]
AG2["Agent Instance 2"]
AG3["Agent Instance N"]
end
subgraph MemAPI["Memory API Tier"]
NJS["Node.js Memory Service\nEpisodic + Procedural + Security"]
PYS["Python Memory Service\nSemantic + FastAPI"]
CW["Consolidation Workers\nNode.js + Redis Streams"]
end
subgraph Storage["Storage Tier"]
PG[("PostgreSQL 16\nEpisodic, Procedural\nAccess grants, Audit log")]
QD[("Qdrant\nSemantic memories")]
RD[("Redis 8\nEvent cache, Pub/Sub\nConsolidation queue")]
end
subgraph Background["Background Tier"]
CRON["Consolidation scheduler\nevery 6h"]
PRUNE["Pruning job\nnightly"]
AUDIT["Audit chain verifier\nnightly"]
end
AG1 & AG2 & AG3 --> NJS & PYS
NJS --> PG & RD
PYS --> QD
CW --> PG & PYS
RD --> CW
CRON & PRUNE & AUDIT --> PG & RD
style Agents fill:#1e3a5f,color:#fff
style MemAPI fill:#166534,color:#fff
style Storage fill:#713f12,color:#fff
style Background fill:#3b0764,color:#fff
Infrastructure Configuration
The following Docker Compose configuration gives you a complete local development environment matching the production topology. Each service is sized for a small production workload and can be scaled independently.
# docker-compose.yml
version: "3.9"
services:
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_DB: agent_memory
POSTGRES_USER: agent_memory_app
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
- ./sql/init.sql:/docker-entrypoint-initdb.d/init.sql
command: >
postgres
-c shared_buffers=256MB
-c work_mem=32MB
-c max_connections=100
-c effective_cache_size=1GB
healthcheck:
test: ["CMD-SHELL", "pg_isready -U agent_memory_app -d agent_memory"]
interval: 10s
timeout: 5s
retries: 5
qdrant:
image: qdrant/qdrant:v1.9.0
ports:
- "6333:6333"
volumes:
- qdrant_data:/qdrant/storage
environment:
QDRANT__SERVICE__API_KEY: ${QDRANT_API_KEY}
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
interval: 10s
timeout: 5s
retries: 5
redis:
image: redis:8-alpine
command: >
redis-server
--requirepass ${REDIS_PASSWORD}
--maxmemory 512mb
--maxmemory-policy allkeys-lru
--appendonly yes
ports:
- "6379:6379"
volumes:
- redis_data:/data
healthcheck:
test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
interval: 10s
timeout: 5s
retries: 5
memory-api-node:
build: ./services/memory-node
environment:
DATABASE_URL: postgres://agent_memory_app:${POSTGRES_PASSWORD}@postgres:5432/agent_memory
REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379
OPENAI_API_KEY: ${OPENAI_API_KEY}
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
SEMANTIC_MEMORY_API_URL: http://memory-api-python:8001
ports:
- "3001:3001"
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
memory-api-python:
build: ./services/memory-python
environment:
QDRANT_URL: http://qdrant:6333
QDRANT_API_KEY: ${QDRANT_API_KEY}
OPENAI_API_KEY: ${OPENAI_API_KEY}
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
ports:
- "8001:8001"
depends_on:
qdrant:
condition: service_healthy
consolidation-worker:
build: ./services/memory-node
command: node memory-worker.js
environment:
DATABASE_URL: postgres://agent_memory_app:${POSTGRES_PASSWORD}@postgres:5432/agent_memory
REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379
OPENAI_API_KEY: ${OPENAI_API_KEY}
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
SEMANTIC_MEMORY_API_URL: http://memory-api-python:8001
deploy:
replicas: 2
depends_on:
- memory-api-python
- redis
- postgres
volumes:
postgres_data:
qdrant_data:
redis_data:Unified Memory Client
The agent-facing interface should be a single unified client that routes calls to the correct underlying service. This shields agent code from the internal topology and lets you swap out individual stores without touching agent logic.
// unified-memory-client.js
// Single interface for agents - wraps all memory types behind one API
import { SecureMemoryWriter } from './secure-memory-writer.js';
import { EpisodicMemoryClient } from './episodic-memory.js';
import { enqueueConsolidation } from './consolidation-enqueuer.js';
import { ProceduralMemoryHttpClient } from './procedural-http-client.js';
const SEMANTIC_API = process.env.SEMANTIC_MEMORY_API_URL || 'http://localhost:8001';
export class AgentMemory {
constructor({ tenantId, userId, agentId, agentType = 'general' }) {
this.tenantId = tenantId;
this.userId = userId;
this.agentId = agentId;
this.agentType = agentType;
this.sessionId = null;
this._writer = new SecureMemoryWriter({ tenantId, agentId, agentType, userId });
this._episodic = new EpisodicMemoryClient({ tenantId, userId, agentId });
this._procedural = new ProceduralMemoryHttpClient({ tenantId, agentId });
}
// --- Session lifecycle ---
async startSession(metadata = {}) {
this.sessionId = await this._episodic.startSession(metadata);
return this.sessionId;
}
async endSession() {
await this._episodic.endSession();
// Enqueue consolidation to run in background
if (this.sessionId) {
await enqueueConsolidation(this.tenantId, this.userId, this.sessionId);
}
this.sessionId = null;
}
// --- Writes ---
async writeConversationTurn(role, content, importance = 0.5) {
return this._writer.writeEpisodicEvent({
sessionId: this.sessionId,
eventType: 'conversation_turn',
content,
role,
importance,
embeddingFn: embed,
});
}
async writeObservation(content, importance = 0.8) {
return this._writer.writeEpisodicEvent({
sessionId: this.sessionId,
eventType: 'observation',
content,
importance,
embeddingFn: embed,
});
}
async writeProcedure({ situationDescription, actionSequence, outcome, outcomeType, successScore, tags }) {
return this._procedural.record({
situationDescription, actionSequence, outcome,
outcomeType, successScore, tags,
sourceSessionId: this.sessionId,
});
}
// --- Reads ---
async buildContext(currentQuery) {
const [episodic, semantic, procedural] = await Promise.all([
this._episodic.retrieveForContext({ query: currentQuery }),
this._retrieveSemantic(currentQuery),
this._procedural.retrieveRelevant(currentQuery, 3),
]);
return this._assembleContext({ episodic, semantic, procedural });
}
async _retrieveSemantic(query) {
const res = await fetch(
`${SEMANTIC_API}/retrieve?` + new URLSearchParams({
tenant_id: this.tenantId,
user_id: this.userId,
query,
top_k: 12,
})
);
const data = await res.json();
return data.results || [];
}
_assembleContext({ episodic, semantic, procedural }) {
const parts = [];
if (semantic.length) {
parts.push('--- Known facts about this user and environment ---');
for (const f of semantic) parts.push(`- ${f.fact}`);
parts.push('');
}
if (procedural.length) {
parts.push('--- Relevant past procedures ---');
for (const p of procedural) {
parts.push(`Situation: ${p.situationDescription}`);
for (const step of p.actionSequence)
parts.push(` ${step.step}. ${step.tool}: ${step.outputSummary}`);
parts.push(`Outcome: ${p.outcome}`);
parts.push('');
}
}
if (episodic.length) {
parts.push('--- Recent conversation history ---');
for (const e of episodic.slice(0, 10))
parts.push(`[${e.role || e.event_type}]: ${e.content}`);
parts.push('');
}
return parts.join('\n');
}
}Monitoring: The Four Metrics That Matter
A memory system without observability is a black box. These four metrics tell you whether the system is healthy and delivering value.
flowchart LR
subgraph Metrics["Key Production Metrics"]
M1["Retrieval latency\np50 / p95 per memory type\nAlert if p95 > 300ms"]
M2["Context hit rate\n% of sessions where memory\nchanged the agent response\nTarget: > 40%"]
M3["Consolidation lag\nhours since last consolidation\nper active user\nAlert if > 24h"]
M4["Semantic store growth\nfacts per user per week\nAlert if > 500/week (runaway extraction)"]
end
style Metrics fill:#1e3a5f,color:#fff
// memory-metrics.js - Prometheus metrics for the memory layer
import { Registry, Histogram, Gauge, Counter } from 'prom-client';
const register = new Registry();
export const retrievalLatency = new Histogram({
name: 'agent_memory_retrieval_duration_seconds',
help: 'Time to retrieve memories per type',
labelNames: ['memory_type', 'tenant_id'],
buckets: [0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 1.0],
registers: [register],
});
export const consolidationLag = new Gauge({
name: 'agent_memory_consolidation_lag_hours',
help: 'Hours since last consolidation per user',
labelNames: ['tenant_id'],
registers: [register],
});
export const memoryWriteTotal = new Counter({
name: 'agent_memory_writes_total',
help: 'Total memory writes by type and outcome',
labelNames: ['memory_type', 'outcome', 'tenant_id'],
registers: [register],
});
export const semanticStoreSize = new Gauge({
name: 'agent_memory_semantic_facts_total',
help: 'Total semantic facts per tenant',
labelNames: ['tenant_id'],
registers: [register],
});
export const piiRedactionsTotal = new Counter({
name: 'agent_memory_pii_redactions_total',
help: 'Total PII redactions in memory write path',
labelNames: ['tenant_id', 'pii_type'],
registers: [register],
});
// Wrap retrievals with timing
export async function timedRetrieval(memoryType, tenantId, fn) {
const end = retrievalLatency.startTimer({ memory_type: memoryType, tenant_id: tenantId });
try {
const result = await fn();
end();
return result;
} catch (err) {
end();
throw err;
}
}Cost Model
Running this system at scale has three cost components: compute, storage, and LLM API calls. The LLM costs are the ones that surprise teams most because the memory layer makes LLM calls you do not see at the agent level.
| Cost component | Driver | Approximate cost (1,000 active users) | Optimisation lever |
|---|---|---|---|
| Embedding generation | Every episodic write + every retrieval query | ~$15/day at text-embedding-3-small pricing | Cache embeddings for repeated queries |
| Fact extraction (consolidation) | claude-haiku per 30-event batch, every 6h | ~$8/day | Increase batch size, reduce frequency for low-activity users |
| PII LLM scrubbing | claude-haiku on high-sensitivity content only | ~$3/day | Restrict to regulated tenants, rely on patterns for others |
| PostgreSQL | Storage + compute for episodic + procedural | ~$60/month (db.t3.medium RDS) | Aggressive pruning, archive to S3 Parquet after 90 days |
| Qdrant | Vector storage for semantic memories | ~$40/month (self-hosted, 2 CPU / 4GB) | Set collection quantisation, reduce vector dimensions for low-precision use cases |
| Redis | Cache and queue | ~$25/month (cache.t3.micro ElastiCache) | Low – Redis is small relative to other stores |
Decision Framework: Which Memory Type to Use
The most common implementation mistake is using the wrong memory type for a given piece of information. This framework maps information types to the correct store.
flowchart TD
Q["New piece of information\nto store in agent memory"]
Q --> A{"Is it tied to\na specific event\nor timestamp?"}
A -- yes --> B{"Will you need\nit verbatim or\njust the gist?"}
B -- verbatim --> EP["Episodic memory\nPostgreSQL + pgvector\nPart 2"]
B -- gist --> SM["Semantic memory\nQdrant\nPart 3"]
A -- no --> C{"Is it a fact\nabout a person,\nsystem, or domain?"}
C -- yes --> SM
C -- no --> D{"Is it a sequence\nof steps or\na learned pattern?"}
D -- yes --> PR["Procedural memory\nPostgreSQL\nPart 4"]
D -- no --> EP
style EP fill:#1e3a5f,color:#fff
style SM fill:#166534,color:#fff
style PR fill:#713f12,color:#fff
Production Readiness Checklist
| Area | Requirement | Covered in |
|---|---|---|
| Storage | PostgreSQL RLS enabled on all memory tables | Part 7 |
| Storage | HNSW indexes on all embedding columns | Parts 2, 4 |
| Storage | Qdrant payload indexes on tenant_id and user_id | Part 3 |
| Privacy | PII scrubbing in write path | Part 7 |
| Privacy | Right to erasure endpoint implemented and tested | Part 7 |
| Security | Access control grants table with deny-by-default | Part 7 |
| Security | Tamper-evident audit log with hash chain | Part 7 |
| Operations | Consolidation workers running with XAUTOCLAIM recovery | Part 5 |
| Operations | Nightly pruning job active | Parts 2, 4 |
| Operations | Audit chain verification job active | Part 7 |
| Monitoring | Retrieval latency p50/p95 dashboards | Part 8 |
| Monitoring | Consolidation lag alert configured | Part 8 |
| Resilience | Memory writes are async and non-blocking to agent response | Part 2 |
| Resilience | Consolidation jobs are durable via Redis Streams | Part 5 |
Series Recap
Across eight parts, this series built a complete production agent memory system from first principles. Part 1 established why stateless agents fail in enterprise and introduced the three memory types. Parts 2 through 4 built each type in a different language: episodic memory in Node.js with PostgreSQL and pgvector, semantic memory in Python with Qdrant, and procedural memory in C# with PostgreSQL. Part 5 added the consolidation worker that compresses episodic history into semantic knowledge on a rolling schedule. Part 6 extended the architecture to multi-agent scenarios with Redis-backed shared memory and workspace coordination. Part 7 added the enterprise security layer: row-level tenant isolation, PII scrubbing, RBAC, and tamper-evident audit logging. This final part assembled everything into a deployable system with infrastructure configuration, unified client, monitoring, cost model, and operational guidance.
The result is an agent that genuinely remembers: not just the last few messages, but what it learned about a user over months, the approaches that worked on similar problems, and the constraints it must respect. That is the foundation of an agent that gets better over time rather than starting from zero at every session boundary.
References
- Anthropic – “Claude Models Overview” (https://docs.anthropic.com/en/docs/about-claude/models/overview)
- pgvector – “Open-Source Vector Similarity Search for PostgreSQL” (https://github.com/pgvector/pgvector)
- Qdrant – “Vector Database Documentation” (https://qdrant.tech/documentation/)
- Redis – “Redis 8 Documentation” (https://redis.io/docs/latest/)
- arXiv – “Generative Agents: Interactive Simulacra of Human Behavior” (https://arxiv.org/abs/2304.03442)
