AI Agents with Memory Part 8: Production Memory Architecture – Putting It All Together

AI Agents with Memory Part 8: Production Memory Architecture – Putting It All Together

Parts 1 through 7 built each layer of the memory system individually: episodic storage in PostgreSQL with pgvector, semantic knowledge in Qdrant, procedural learning in C#, consolidation workers in Node.js, multi-agent shared memory with Redis, and a security layer with tenant isolation, PII scrubbing, and audit logging. This final part assembles all of those layers into a single coherent production system. It covers the full reference architecture, infrastructure configuration, monitoring strategy, cost model, and the decision framework for applying memory correctly to real agent problems.

Complete Reference Architecture

The full system has four runtime tiers: the agent tier (LLM-backed agents making memory calls), the memory API tier (the Node.js and Python services from previous parts), the storage tier (PostgreSQL, Qdrant, Redis), and the background tier (consolidation workers and audit verification jobs).

flowchart TD
    subgraph Agents["Agent Tier"]
        AG1["Agent Instance 1"]
        AG2["Agent Instance 2"]
        AG3["Agent Instance N"]
    end

    subgraph MemAPI["Memory API Tier"]
        NJS["Node.js Memory Service\nEpisodic + Procedural + Security"]
        PYS["Python Memory Service\nSemantic + FastAPI"]
        CW["Consolidation Workers\nNode.js + Redis Streams"]
    end

    subgraph Storage["Storage Tier"]
        PG[("PostgreSQL 16\nEpisodic, Procedural\nAccess grants, Audit log")]
        QD[("Qdrant\nSemantic memories")]
        RD[("Redis 8\nEvent cache, Pub/Sub\nConsolidation queue")]
    end

    subgraph Background["Background Tier"]
        CRON["Consolidation scheduler\nevery 6h"]
        PRUNE["Pruning job\nnightly"]
        AUDIT["Audit chain verifier\nnightly"]
    end

    AG1 & AG2 & AG3 --> NJS & PYS
    NJS --> PG & RD
    PYS --> QD
    CW --> PG & PYS
    RD --> CW
    CRON & PRUNE & AUDIT --> PG & RD

    style Agents fill:#1e3a5f,color:#fff
    style MemAPI fill:#166534,color:#fff
    style Storage fill:#713f12,color:#fff
    style Background fill:#3b0764,color:#fff

Infrastructure Configuration

The following Docker Compose configuration gives you a complete local development environment matching the production topology. Each service is sized for a small production workload and can be scaled independently.

# docker-compose.yml
version: "3.9"

services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: agent_memory
      POSTGRES_USER: agent_memory_app
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./sql/init.sql:/docker-entrypoint-initdb.d/init.sql
    command: >
      postgres
        -c shared_buffers=256MB
        -c work_mem=32MB
        -c max_connections=100
        -c effective_cache_size=1GB
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U agent_memory_app -d agent_memory"]
      interval: 10s
      timeout: 5s
      retries: 5

  qdrant:
    image: qdrant/qdrant:v1.9.0
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      QDRANT__SERVICE__API_KEY: ${QDRANT_API_KEY}
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:8-alpine
    command: >
      redis-server
        --requirepass ${REDIS_PASSWORD}
        --maxmemory 512mb
        --maxmemory-policy allkeys-lru
        --appendonly yes
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

  memory-api-node:
    build: ./services/memory-node
    environment:
      DATABASE_URL: postgres://agent_memory_app:${POSTGRES_PASSWORD}@postgres:5432/agent_memory
      REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
      SEMANTIC_MEMORY_API_URL: http://memory-api-python:8001
    ports:
      - "3001:3001"
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy

  memory-api-python:
    build: ./services/memory-python
    environment:
      QDRANT_URL: http://qdrant:6333
      QDRANT_API_KEY: ${QDRANT_API_KEY}
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
    ports:
      - "8001:8001"
    depends_on:
      qdrant:
        condition: service_healthy

  consolidation-worker:
    build: ./services/memory-node
    command: node memory-worker.js
    environment:
      DATABASE_URL: postgres://agent_memory_app:${POSTGRES_PASSWORD}@postgres:5432/agent_memory
      REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
      SEMANTIC_MEMORY_API_URL: http://memory-api-python:8001
    deploy:
      replicas: 2
    depends_on:
      - memory-api-python
      - redis
      - postgres

volumes:
  postgres_data:
  qdrant_data:
  redis_data:

Unified Memory Client

The agent-facing interface should be a single unified client that routes calls to the correct underlying service. This shields agent code from the internal topology and lets you swap out individual stores without touching agent logic.

// unified-memory-client.js
// Single interface for agents - wraps all memory types behind one API

import { SecureMemoryWriter } from './secure-memory-writer.js';
import { EpisodicMemoryClient } from './episodic-memory.js';
import { enqueueConsolidation } from './consolidation-enqueuer.js';
import { ProceduralMemoryHttpClient } from './procedural-http-client.js';

const SEMANTIC_API = process.env.SEMANTIC_MEMORY_API_URL || 'http://localhost:8001';

export class AgentMemory {
  constructor({ tenantId, userId, agentId, agentType = 'general' }) {
    this.tenantId = tenantId;
    this.userId = userId;
    this.agentId = agentId;
    this.agentType = agentType;
    this.sessionId = null;

    this._writer = new SecureMemoryWriter({ tenantId, agentId, agentType, userId });
    this._episodic = new EpisodicMemoryClient({ tenantId, userId, agentId });
    this._procedural = new ProceduralMemoryHttpClient({ tenantId, agentId });
  }

  // --- Session lifecycle ---

  async startSession(metadata = {}) {
    this.sessionId = await this._episodic.startSession(metadata);
    return this.sessionId;
  }

  async endSession() {
    await this._episodic.endSession();
    // Enqueue consolidation to run in background
    if (this.sessionId) {
      await enqueueConsolidation(this.tenantId, this.userId, this.sessionId);
    }
    this.sessionId = null;
  }

  // --- Writes ---

  async writeConversationTurn(role, content, importance = 0.5) {
    return this._writer.writeEpisodicEvent({
      sessionId: this.sessionId,
      eventType: 'conversation_turn',
      content,
      role,
      importance,
      embeddingFn: embed,
    });
  }

  async writeObservation(content, importance = 0.8) {
    return this._writer.writeEpisodicEvent({
      sessionId: this.sessionId,
      eventType: 'observation',
      content,
      importance,
      embeddingFn: embed,
    });
  }

  async writeProcedure({ situationDescription, actionSequence, outcome, outcomeType, successScore, tags }) {
    return this._procedural.record({
      situationDescription, actionSequence, outcome,
      outcomeType, successScore, tags,
      sourceSessionId: this.sessionId,
    });
  }

  // --- Reads ---

  async buildContext(currentQuery) {
    const [episodic, semantic, procedural] = await Promise.all([
      this._episodic.retrieveForContext({ query: currentQuery }),
      this._retrieveSemantic(currentQuery),
      this._procedural.retrieveRelevant(currentQuery, 3),
    ]);

    return this._assembleContext({ episodic, semantic, procedural });
  }

  async _retrieveSemantic(query) {
    const res = await fetch(
      `${SEMANTIC_API}/retrieve?` + new URLSearchParams({
        tenant_id: this.tenantId,
        user_id: this.userId,
        query,
        top_k: 12,
      })
    );
    const data = await res.json();
    return data.results || [];
  }

  _assembleContext({ episodic, semantic, procedural }) {
    const parts = [];

    if (semantic.length) {
      parts.push('--- Known facts about this user and environment ---');
      for (const f of semantic) parts.push(`- ${f.fact}`);
      parts.push('');
    }

    if (procedural.length) {
      parts.push('--- Relevant past procedures ---');
      for (const p of procedural) {
        parts.push(`Situation: ${p.situationDescription}`);
        for (const step of p.actionSequence)
          parts.push(`  ${step.step}. ${step.tool}: ${step.outputSummary}`);
        parts.push(`Outcome: ${p.outcome}`);
        parts.push('');
      }
    }

    if (episodic.length) {
      parts.push('--- Recent conversation history ---');
      for (const e of episodic.slice(0, 10))
        parts.push(`[${e.role || e.event_type}]: ${e.content}`);
      parts.push('');
    }

    return parts.join('\n');
  }
}

Monitoring: The Four Metrics That Matter

A memory system without observability is a black box. These four metrics tell you whether the system is healthy and delivering value.

flowchart LR
    subgraph Metrics["Key Production Metrics"]
        M1["Retrieval latency\np50 / p95 per memory type\nAlert if p95 > 300ms"]
        M2["Context hit rate\n% of sessions where memory\nchanged the agent response\nTarget: > 40%"]
        M3["Consolidation lag\nhours since last consolidation\nper active user\nAlert if > 24h"]
        M4["Semantic store growth\nfacts per user per week\nAlert if > 500/week (runaway extraction)"]
    end

    style Metrics fill:#1e3a5f,color:#fff
// memory-metrics.js - Prometheus metrics for the memory layer
import { Registry, Histogram, Gauge, Counter } from 'prom-client';

const register = new Registry();

export const retrievalLatency = new Histogram({
  name: 'agent_memory_retrieval_duration_seconds',
  help: 'Time to retrieve memories per type',
  labelNames: ['memory_type', 'tenant_id'],
  buckets: [0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 1.0],
  registers: [register],
});

export const consolidationLag = new Gauge({
  name: 'agent_memory_consolidation_lag_hours',
  help: 'Hours since last consolidation per user',
  labelNames: ['tenant_id'],
  registers: [register],
});

export const memoryWriteTotal = new Counter({
  name: 'agent_memory_writes_total',
  help: 'Total memory writes by type and outcome',
  labelNames: ['memory_type', 'outcome', 'tenant_id'],
  registers: [register],
});

export const semanticStoreSize = new Gauge({
  name: 'agent_memory_semantic_facts_total',
  help: 'Total semantic facts per tenant',
  labelNames: ['tenant_id'],
  registers: [register],
});

export const piiRedactionsTotal = new Counter({
  name: 'agent_memory_pii_redactions_total',
  help: 'Total PII redactions in memory write path',
  labelNames: ['tenant_id', 'pii_type'],
  registers: [register],
});

// Wrap retrievals with timing
export async function timedRetrieval(memoryType, tenantId, fn) {
  const end = retrievalLatency.startTimer({ memory_type: memoryType, tenant_id: tenantId });
  try {
    const result = await fn();
    end();
    return result;
  } catch (err) {
    end();
    throw err;
  }
}

Cost Model

Running this system at scale has three cost components: compute, storage, and LLM API calls. The LLM costs are the ones that surprise teams most because the memory layer makes LLM calls you do not see at the agent level.

Cost componentDriverApproximate cost (1,000 active users)Optimisation lever
Embedding generationEvery episodic write + every retrieval query~$15/day at text-embedding-3-small pricingCache embeddings for repeated queries
Fact extraction (consolidation)claude-haiku per 30-event batch, every 6h~$8/dayIncrease batch size, reduce frequency for low-activity users
PII LLM scrubbingclaude-haiku on high-sensitivity content only~$3/dayRestrict to regulated tenants, rely on patterns for others
PostgreSQLStorage + compute for episodic + procedural~$60/month (db.t3.medium RDS)Aggressive pruning, archive to S3 Parquet after 90 days
QdrantVector storage for semantic memories~$40/month (self-hosted, 2 CPU / 4GB)Set collection quantisation, reduce vector dimensions for low-precision use cases
RedisCache and queue~$25/month (cache.t3.micro ElastiCache)Low – Redis is small relative to other stores

Decision Framework: Which Memory Type to Use

The most common implementation mistake is using the wrong memory type for a given piece of information. This framework maps information types to the correct store.

flowchart TD
    Q["New piece of information\nto store in agent memory"]

    Q --> A{"Is it tied to\na specific event\nor timestamp?"}
    A -- yes --> B{"Will you need\nit verbatim or\njust the gist?"}
    B -- verbatim --> EP["Episodic memory\nPostgreSQL + pgvector\nPart 2"]
    B -- gist --> SM["Semantic memory\nQdrant\nPart 3"]
    A -- no --> C{"Is it a fact\nabout a person,\nsystem, or domain?"}
    C -- yes --> SM
    C -- no --> D{"Is it a sequence\nof steps or\na learned pattern?"}
    D -- yes --> PR["Procedural memory\nPostgreSQL\nPart 4"]
    D -- no --> EP

    style EP fill:#1e3a5f,color:#fff
    style SM fill:#166534,color:#fff
    style PR fill:#713f12,color:#fff

Production Readiness Checklist

AreaRequirementCovered in
StoragePostgreSQL RLS enabled on all memory tablesPart 7
StorageHNSW indexes on all embedding columnsParts 2, 4
StorageQdrant payload indexes on tenant_id and user_idPart 3
PrivacyPII scrubbing in write pathPart 7
PrivacyRight to erasure endpoint implemented and testedPart 7
SecurityAccess control grants table with deny-by-defaultPart 7
SecurityTamper-evident audit log with hash chainPart 7
OperationsConsolidation workers running with XAUTOCLAIM recoveryPart 5
OperationsNightly pruning job activeParts 2, 4
OperationsAudit chain verification job activePart 7
MonitoringRetrieval latency p50/p95 dashboardsPart 8
MonitoringConsolidation lag alert configuredPart 8
ResilienceMemory writes are async and non-blocking to agent responsePart 2
ResilienceConsolidation jobs are durable via Redis StreamsPart 5

Series Recap

Across eight parts, this series built a complete production agent memory system from first principles. Part 1 established why stateless agents fail in enterprise and introduced the three memory types. Parts 2 through 4 built each type in a different language: episodic memory in Node.js with PostgreSQL and pgvector, semantic memory in Python with Qdrant, and procedural memory in C# with PostgreSQL. Part 5 added the consolidation worker that compresses episodic history into semantic knowledge on a rolling schedule. Part 6 extended the architecture to multi-agent scenarios with Redis-backed shared memory and workspace coordination. Part 7 added the enterprise security layer: row-level tenant isolation, PII scrubbing, RBAC, and tamper-evident audit logging. This final part assembled everything into a deployable system with infrastructure configuration, unified client, monitoring, cost model, and operational guidance.

The result is an agent that genuinely remembers: not just the last few messages, but what it learned about a user over months, the approaches that worked on similar problems, and the constraints it must respect. That is the foundation of an agent that gets better over time rather than starting from zero at every session boundary.

References

Written by:

631 Posts

View All Posts
Follow Me :
How to whitelist website on AdBlocker?

How to whitelist website on AdBlocker?

  1. 1 Click on the AdBlock Plus icon on the top right corner of your browser
  2. 2 Click on "Enabled on this site" from the AdBlock Plus option
  3. 3 Refresh the page and start browsing the site