AI Agents with Memory Part 8: Production Memory Architecture – Putting It All Together → Explore with me!

Parts 1 through 7 built each layer of the memory system individually: episodic storage in PostgreSQL with pgvector, semantic knowledge in Qdrant, procedural learning in C#, consolidation workers in Node.js, multi-agent shared memory with Redis, and a security layer with tenant isolation, PII scrubbing, and audit logging. This final part assembles all of those layers into a single coherent production system. It covers the full reference architecture, infrastructure configuration, monitoring strategy, cost model, and the decision framework for applying memory correctly to real agent problems.

Complete Reference Architecture

The full system has four runtime tiers: the agent tier (LLM-backed agents making memory calls), the memory API tier (the Node.js and Python services from previous parts), the storage tier (PostgreSQL, Qdrant, Redis), and the background tier (consolidation workers and audit verification jobs).

flowchart TD
    subgraph Agents["Agent Tier"]
        AG1["Agent Instance 1"]
        AG2["Agent Instance 2"]
        AG3["Agent Instance N"]
    end

    subgraph MemAPI["Memory API Tier"]
        NJS["Node.js Memory Service\nEpisodic + Procedural + Security"]
        PYS["Python Memory Service\nSemantic + FastAPI"]
        CW["Consolidation Workers\nNode.js + Redis Streams"]
    end

    subgraph Storage["Storage Tier"]
        PG[("PostgreSQL 16\nEpisodic, Procedural\nAccess grants, Audit log")]
        QD[("Qdrant\nSemantic memories")]
        RD[("Redis 8\nEvent cache, Pub/Sub\nConsolidation queue")]
    end

    subgraph Background["Background Tier"]
        CRON["Consolidation scheduler\nevery 6h"]
        PRUNE["Pruning job\nnightly"]
        AUDIT["Audit chain verifier\nnightly"]
    end

    AG1 & AG2 & AG3 --> NJS & PYS
    NJS --> PG & RD
    PYS --> QD
    CW --> PG & PYS
    RD --> CW
    CRON & PRUNE & AUDIT --> PG & RD

    style Agents fill:#1e3a5f,color:#fff
    style MemAPI fill:#166534,color:#fff
    style Storage fill:#713f12,color:#fff
    style Background fill:#3b0764,color:#fff

Infrastructure Configuration

The following Docker Compose configuration gives you a complete local development environment matching the production topology. Each service is sized for a small production workload and can be scaled independently.

# docker-compose.yml
version: "3.9"

services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: agent_memory
      POSTGRES_USER: agent_memory_app
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./sql/init.sql:/docker-entrypoint-initdb.d/init.sql
    command: >
      postgres
        -c shared_buffers=256MB
        -c work_mem=32MB
        -c max_connections=100
        -c effective_cache_size=1GB
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U agent_memory_app -d agent_memory"]
      interval: 10s
      timeout: 5s
      retries: 5

  qdrant:
    image: qdrant/qdrant:v1.9.0
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      QDRANT__SERVICE__API_KEY: ${QDRANT_API_KEY}
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:8-alpine
    command: >
      redis-server
        --requirepass ${REDIS_PASSWORD}
        --maxmemory 512mb
        --maxmemory-policy allkeys-lru
        --appendonly yes
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

  memory-api-node:
    build: ./services/memory-node
    environment:
      DATABASE_URL: postgres://agent_memory_app:${POSTGRES_PASSWORD}@postgres:5432/agent_memory
      REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
      SEMANTIC_MEMORY_API_URL: http://memory-api-python:8001
    ports:
      - "3001:3001"
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy

  memory-api-python:
    build: ./services/memory-python
    environment:
      QDRANT_URL: http://qdrant:6333
      QDRANT_API_KEY: ${QDRANT_API_KEY}
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
    ports:
      - "8001:8001"
    depends_on:
      qdrant:
        condition: service_healthy

  consolidation-worker:
    build: ./services/memory-node
    command: node memory-worker.js
    environment:
      DATABASE_URL: postgres://agent_memory_app:${POSTGRES_PASSWORD}@postgres:5432/agent_memory
      REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
      SEMANTIC_MEMORY_API_URL: http://memory-api-python:8001
    deploy:
      replicas: 2
    depends_on:
      - memory-api-python
      - redis
      - postgres

volumes:
  postgres_data:
  qdrant_data:
  redis_data:

Unified Memory Client

The agent-facing interface should be a single unified client that routes calls to the correct underlying service. This shields agent code from the internal topology and lets you swap out individual stores without touching agent logic.

// unified-memory-client.js
// Single interface for agents - wraps all memory types behind one API

import { SecureMemoryWriter } from './secure-memory-writer.js';
import { EpisodicMemoryClient } from './episodic-memory.js';
import { enqueueConsolidation } from './consolidation-enqueuer.js';
import { ProceduralMemoryHttpClient } from './procedural-http-client.js';

const SEMANTIC_API = process.env.SEMANTIC_MEMORY_API_URL || 'http://localhost:8001';

export class AgentMemory {
  constructor({ tenantId, userId, agentId, agentType = 'general' }) {
    this.tenantId = tenantId;
    this.userId = userId;
    this.agentId = agentId;
    this.agentType = agentType;
    this.sessionId = null;

    this._writer = new SecureMemoryWriter({ tenantId, agentId, agentType, userId });
    this._episodic = new EpisodicMemoryClient({ tenantId, userId, agentId });
    this._procedural = new ProceduralMemoryHttpClient({ tenantId, agentId });
  }

  // --- Session lifecycle ---

  async startSession(metadata = {}) {
    this.sessionId = await this._episodic.startSession(metadata);
    return this.sessionId;
  }

  async endSession() {
    await this._episodic.endSession();
    // Enqueue consolidation to run in background
    if (this.sessionId) {
      await enqueueConsolidation(this.tenantId, this.userId, this.sessionId);
    }
    this.sessionId = null;
  }

  // --- Writes ---

  async writeConversationTurn(role, content, importance = 0.5) {
    return this._writer.writeEpisodicEvent({
      sessionId: this.sessionId,
      eventType: 'conversation_turn',
      content,
      role,
      importance,
      embeddingFn: embed,
    });
  }

  async writeObservation(content, importance = 0.8) {
    return this._writer.writeEpisodicEvent({
      sessionId: this.sessionId,
      eventType: 'observation',
      content,
      importance,
      embeddingFn: embed,
    });
  }

  async writeProcedure({ situationDescription, actionSequence, outcome, outcomeType, successScore, tags }) {
    return this._procedural.record({
      situationDescription, actionSequence, outcome,
      outcomeType, successScore, tags,
      sourceSessionId: this.sessionId,
    });
  }

  // --- Reads ---

  async buildContext(currentQuery) {
    const [episodic, semantic, procedural] = await Promise.all([
      this._episodic.retrieveForContext({ query: currentQuery }),
      this._retrieveSemantic(currentQuery),
      this._procedural.retrieveRelevant(currentQuery, 3),
    ]);

    return this._assembleContext({ episodic, semantic, procedural });
  }

  async _retrieveSemantic(query) {
    const res = await fetch(
      `${SEMANTIC_API}/retrieve?` + new URLSearchParams({
        tenant_id: this.tenantId,
        user_id: this.userId,
        query,
        top_k: 12,
      })
    );
    const data = await res.json();
    return data.results || [];
  }

  _assembleContext({ episodic, semantic, procedural }) {
    const parts = [];

    if (semantic.length) {
      parts.push('--- Known facts about this user and environment ---');
      for (const f of semantic) parts.push(`- ${f.fact}`);
      parts.push('');
    }

    if (procedural.length) {
      parts.push('--- Relevant past procedures ---');
      for (const p of procedural) {
        parts.push(`Situation: ${p.situationDescription}`);
        for (const step of p.actionSequence)
          parts.push(`  ${step.step}. ${step.tool}: ${step.outputSummary}`);
        parts.push(`Outcome: ${p.outcome}`);
        parts.push('');
      }
    }

    if (episodic.length) {
      parts.push('--- Recent conversation history ---');
      for (const e of episodic.slice(0, 10))
        parts.push(`[${e.role || e.event_type}]: ${e.content}`);
      parts.push('');
    }

    return parts.join('\n');
  }
}

Monitoring: The Four Metrics That Matter

A memory system without observability is a black box. These four metrics tell you whether the system is healthy and delivering value.

flowchart LR
    subgraph Metrics["Key Production Metrics"]
        M1["Retrieval latency\np50 / p95 per memory type\nAlert if p95 > 300ms"]
        M2["Context hit rate\n% of sessions where memory\nchanged the agent response\nTarget: > 40%"]
        M3["Consolidation lag\nhours since last consolidation\nper active user\nAlert if > 24h"]
        M4["Semantic store growth\nfacts per user per week\nAlert if > 500/week (runaway extraction)"]
    end

    style Metrics fill:#1e3a5f,color:#fff

// memory-metrics.js - Prometheus metrics for the memory layer
import { Registry, Histogram, Gauge, Counter } from 'prom-client';

const register = new Registry();

export const retrievalLatency = new Histogram({
  name: 'agent_memory_retrieval_duration_seconds',
  help: 'Time to retrieve memories per type',
  labelNames: ['memory_type', 'tenant_id'],
  buckets: [0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 1.0],
  registers: [register],
});

export const consolidationLag = new Gauge({
  name: 'agent_memory_consolidation_lag_hours',
  help: 'Hours since last consolidation per user',
  labelNames: ['tenant_id'],
  registers: [register],
});

export const memoryWriteTotal = new Counter({
  name: 'agent_memory_writes_total',
  help: 'Total memory writes by type and outcome',
  labelNames: ['memory_type', 'outcome', 'tenant_id'],
  registers: [register],
});

export const semanticStoreSize = new Gauge({
  name: 'agent_memory_semantic_facts_total',
  help: 'Total semantic facts per tenant',
  labelNames: ['tenant_id'],
  registers: [register],
});

export const piiRedactionsTotal = new Counter({
  name: 'agent_memory_pii_redactions_total',
  help: 'Total PII redactions in memory write path',
  labelNames: ['tenant_id', 'pii_type'],
  registers: [register],
});

// Wrap retrievals with timing
export async function timedRetrieval(memoryType, tenantId, fn) {
  const end = retrievalLatency.startTimer({ memory_type: memoryType, tenant_id: tenantId });
  try {
    const result = await fn();
    end();
    return result;
  } catch (err) {
    end();
    throw err;
  }
}

Cost Model

Running this system at scale has three cost components: compute, storage, and LLM API calls. The LLM costs are the ones that surprise teams most because the memory layer makes LLM calls you do not see at the agent level.

Cost component	Driver	Approximate cost (1,000 active users)	Optimisation lever
Embedding generation	Every episodic write + every retrieval query	~$15/day at text-embedding-3-small pricing	Cache embeddings for repeated queries
Fact extraction (consolidation)	claude-haiku per 30-event batch, every 6h	~$8/day	Increase batch size, reduce frequency for low-activity users
PII LLM scrubbing	claude-haiku on high-sensitivity content only	~$3/day	Restrict to regulated tenants, rely on patterns for others
PostgreSQL	Storage + compute for episodic + procedural	~$60/month (db.t3.medium RDS)	Aggressive pruning, archive to S3 Parquet after 90 days
Qdrant	Vector storage for semantic memories	~$40/month (self-hosted, 2 CPU / 4GB)	Set collection quantisation, reduce vector dimensions for low-precision use cases
Redis	Cache and queue	~$25/month (cache.t3.micro ElastiCache)	Low – Redis is small relative to other stores

Decision Framework: Which Memory Type to Use

The most common implementation mistake is using the wrong memory type for a given piece of information. This framework maps information types to the correct store.

flowchart TD
    Q["New piece of information\nto store in agent memory"]

    Q --> A{"Is it tied to\na specific event\nor timestamp?"}
    A -- yes --> B{"Will you need\nit verbatim or\njust the gist?"}
    B -- verbatim --> EP["Episodic memory\nPostgreSQL + pgvector\nPart 2"]
    B -- gist --> SM["Semantic memory\nQdrant\nPart 3"]
    A -- no --> C{"Is it a fact\nabout a person,\nsystem, or domain?"}
    C -- yes --> SM
    C -- no --> D{"Is it a sequence\nof steps or\na learned pattern?"}
    D -- yes --> PR["Procedural memory\nPostgreSQL\nPart 4"]
    D -- no --> EP

    style EP fill:#1e3a5f,color:#fff
    style SM fill:#166534,color:#fff
    style PR fill:#713f12,color:#fff

Production Readiness Checklist

Area	Requirement	Covered in
Storage	PostgreSQL RLS enabled on all memory tables	Part 7
Storage	HNSW indexes on all embedding columns	Parts 2, 4
Storage	Qdrant payload indexes on tenant_id and user_id	Part 3
Privacy	PII scrubbing in write path	Part 7
Privacy	Right to erasure endpoint implemented and tested	Part 7
Security	Access control grants table with deny-by-default	Part 7
Security	Tamper-evident audit log with hash chain	Part 7
Operations	Consolidation workers running with XAUTOCLAIM recovery	Part 5
Operations	Nightly pruning job active	Parts 2, 4
Operations	Audit chain verification job active	Part 7
Monitoring	Retrieval latency p50/p95 dashboards	Part 8
Monitoring	Consolidation lag alert configured	Part 8
Resilience	Memory writes are async and non-blocking to agent response	Part 2
Resilience	Consolidation jobs are durable via Redis Streams	Part 5

Series Recap

Across eight parts, this series built a complete production agent memory system from first principles. Part 1 established why stateless agents fail in enterprise and introduced the three memory types. Parts 2 through 4 built each type in a different language: episodic memory in Node.js with PostgreSQL and pgvector, semantic memory in Python with Qdrant, and procedural memory in C# with PostgreSQL. Part 5 added the consolidation worker that compresses episodic history into semantic knowledge on a rolling schedule. Part 6 extended the architecture to multi-agent scenarios with Redis-backed shared memory and workspace coordination. Part 7 added the enterprise security layer: row-level tenant isolation, PII scrubbing, RBAC, and tamper-evident audit logging. This final part assembled everything into a deployable system with infrastructure configuration, unified client, monitoring, cost model, and operational guidance.

The result is an agent that genuinely remembers: not just the last few messages, but what it learned about a user over months, the approaches that worked on similar problems, and the constraints it must respect. That is the foundation of an agent that gets better over time rather than starting from zero at every session boundary.

AI Agents with Memory Part 8: Production Memory Architecture – Putting It All Together

Complete Reference Architecture

Infrastructure Configuration

Unified Memory Client

Monitoring: The Four Metrics That Matter

Cost Model

Decision Framework: Which Memory Type to Use

Production Readiness Checklist

Series Recap

References

Like this:

You may like

Written by:

Chandan 631 Posts

You May Have Missed

The Complete Picture: Balancing Professional and Personal Support Systems

For Parents, Partners, and Friends: A Guide to Supporting Your Loved One in Tech

The HR Conversation: When and How to Involve HR in Your Mental Health Journey

Finding Your Tech Tribe: The Power of Peer Support Groups

How to whitelist website on AdBlocker?

Complete Reference Architecture

Infrastructure Configuration

Unified Memory Client

Monitoring: The Four Metrics That Matter

Cost Model

Decision Framework: Which Memory Type to Use

Production Readiness Checklist

Series Recap

References

Like this:

You may like

Written by:

Chandan 631 Posts

Related Posts

AI Agents with Memory Part 7: Memory Security and Privacy – Tenant Isolation, PII Scrubbing, and Access Control

AI Agents with Memory Part 6: Multi-Agent Memory Sharing – Shared Memory Spaces Across Agent Networks with Redis and PostgreSQL

AI Agents with Memory Part 5: Memory Consolidation – Summarising and Compressing Long-Term History with Node.js Background Workers

You May Have Missed

The Complete Picture: Balancing Professional and Personal Support Systems

For Parents, Partners, and Friends: A Guide to Supporting Your Loved One in Tech

The HR Conversation: When and How to Involve HR in Your Mental Health Journey

Finding Your Tech Tribe: The Power of Peer Support Groups

How to whitelist website on AdBlocker?