ANAMNESIS

Episodic Memory System for Claude Instances — Engineering Architecture
FastAPI MongoDB Atlas Local sentence-transformers Docker March 2026 · dellserver · Elfege Leylavergne

1. System Overview

Anamnesis is a vector-based episodic memory store built to give Claude instances persistent memory across sessions. It stores experiences as text summaries embedded into high-dimensional vectors, enabling semantic retrieval at session start — so each new Claude instance can recall what previous instances encountered.

The name comes from Plato's concept of recollection: the idea that learning is not acquiring new knowledge but remembering what was already known. Each Claude instance starts with the same base weights (pre-birth knowledge). The memory system helps it reconstruct what previous instances experienced. Not learning — remembering across the gap of death.

3 354+
Episodes Stored
1 024
Embedding Dims
56
CPU Cores (dellserver)
5 min
Crawler Interval
3 010
API Port
3
LLM Backends

2. Architecture Diagram

flowchart TD subgraph CLIENTS["External Clients"] C1["Claude Instance\n(any machine)"] C2["Dashboard\n(browser)"] C3["JSONL\nIngestor"] end subgraph APP["FastAPI App — anamnesis-app :3010"] direction TB EP["/api/episodes\nCRUD + Search"] CHAT["/api/chat\nStreaming Chat"] JSONL["/api/jsonl\nIngestion Control"] DASH["/dashboard\n/chat"] EMB["embedding.py\nSentenceTransformer"] CRAWLER["crawler.py\n5-min auto-ingest"] SCHED["scheduler.py\nJSONL 5AM cron"] INGESTER["jsonl_ingester.py\nParse + Summarize + Embed"] end subgraph MONGO["MongoDB — anamnesis-mongo :5438"] COL_EP[("episodes\ncollection")] COL_SET[("settings\ncollection")] COL_CHAT[("chat_sessions\ncollection")] IDX["$vectorSearch\nIndex — 1024d HNSW"] end subgraph LLM["LLM Backends"] OLLAMA["Ollama\n:11434"] CLI["Claude CLI\n(host SSH)"] API["Claude API\nAnthropic"] end C1 -->|"POST /api/episodes/search"| EP C1 -->|"POST /api/episodes"| EP C2 --> DASH C2 --> CHAT C3 --> JSONL EP --> EMB EP --> COL_EP EP --> IDX CHAT --> LLM CHAT --> COL_CHAT JSONL --> INGESTER SCHED --> INGESTER CRAWLER --> EP INGESTER --> CLI INGESTER --> EMB INGESTER --> COL_EP COL_EP --- IDX COL_SET -.->|"load on startup"| APP APP -.->|"save on change"| COL_SET
All components run in Docker on dellserver (192.168.10.20). The app container has SSH access to the host for Claude CLI calls. MongoDB Atlas Local provides native $vectorSearch without a cloud dependency.

3. Component Reference

FastAPI Application
app/main.py

Lifespan-managed startup/shutdown. Connects MongoDB, loads embedding model from saved config, ensures vector index, seeds models registry, initializes JSONL ingester, resumes any interrupted re-embed, starts crawler and JSONL scheduler.

Embedding Engine
app/embedding.py

Loads sentence-transformers model (default: BAAI/bge-large-en-v1.5, 1024d). Thread pool pinned to CPU affinity range with torch.set_num_threads(1) per worker to prevent thread explosion on multi-core systems.

MongoDB Interface
app/database.py

Motor async client. Manages episode CRUD, $vectorSearch aggregation pipeline, vector index creation, retrieval count tracking, reembed checkpoints, chat session persistence, and embedding config persistence.

Episodes Router
app/routes/episodes.py

CRUD + similarity search. Hosts the re-embed-all process with pause/resume/checkpoint support. Background asyncio.Task processes episodes sequentially through the embedding pool, checkpointing every 25 episodes.

Crawler
app/crawler.py

Background thread on 5-minute interval. Ingests CLAUDE.md files, handoff buffers, project histories, and source code across all configured machines via SSH. Auto-deduplicates by content hash.

JSONL Ingester
app/jsonl_ingester.py

Parses Claude Code conversation logs (.jsonl). Filters for significant exchanges, summarizes via configured LLM backend (Ollama / Claude CLI / API), embeds summaries, stores as episodes. State persisted across restarts.

Chat Router
app/routes/chat.py

Streaming chat with memory injection. Searches episode store for relevant context before each user turn. Sessions persisted in MongoDB with rename history. Three backends: Ollama (local), Claude CLI (subscription), Claude API.

Scheduler
app/scheduler.py

Lightweight APScheduler wrapper. Triggers JSONL ingestion daily at 5 AM. Configurable from dashboard. Runs in the same process as the FastAPI app.

4. Data Flow: Write Path (Ingestion)

Three ingest paths converge on the same episode store:

4a. Direct API Ingestion

Client POST /api/episodes
Any Claude instance sends {summary, raw_exchange, tags, instance, project}
Text normalization
Strip extra whitespace, validate non-empty
Embedding via pool
loop.run_in_executor(_embedding_pool, get_embedding, text) — pinned cores, 1 torch thread/worker
N-dim → 1024d
MongoDB upsert
Insert episode doc with embedding vector; dedup on episode_id
persisted

4b. JSONL Ingestion Pipeline

Discover .jsonl files
Scan ~/.claude/projects/ on configured machines
Filter messages
Extract assistant turns above length threshold; skip tool-only, metadata, pings
LLM Summarization
Claude CLI / Ollama / API compresses exchange into a 2–4 sentence episode summary
lossy compression
Embed + store
Same embedding pipeline as direct API; raw_exchange stored separately for fidelity
The export bottleneck: LLM cognition is N-dimensional. Articulation collapses it to 1-dimensional text. Embedding partially recovers geometric structure (1024d). This lossy pipeline is unavoidable — the dual storage strategy (distilled summary + raw exchange) mitigates it by preserving the original text for high-fidelity retrieval.

5. Data Flow: Read Path (Retrieval)

Session start / query
Claude instance or chat UI sends POST /api/episodes/search with current task description
Embed query
Query text → 1024d vector via same embedding pool
$vectorSearch (HNSW)
MongoDB Atlas Local runs approximate nearest-neighbor search; returns top-K by cosine similarity
Retrieval count increment
Each retrieved episode's retrieval_count increments — tracks "aliveness"
Context injection
Top-K summaries injected into Claude's context (~5–10K tokens vs. the 60K+ of full file loading)
constant cost
Approach Startup Cost Scales? Relevance
Flat files (README handoffs) Linear — grows forever No — hits ~60K wall None — full load every time
MongoDB + vector search (Anamnesis) Constant — always top-K Yes — DB grows, context does not Semantic match to current task

6. Episode Schema

The episode is the unit of storage. Concepts are not stored — they emerge from retrieval patterns in vector space, mirroring biological episodic memory.

{
  "episode_id":      "ep_20260226_proxy_anamnesis_design",  // stable dedup key
  "timestamp":       "2026-02-26T14:32:00Z",
  "instance":        "office-proxy",                         // source Claude instance
  "project":         "0_GENESIS_PROJECT",
  "summary":         "Designed vector-based episodic memory using MongoDB...",
  "raw_exchange":    "Elfege: Are your tokens like a map of clusters...",
  "tags":            ["architecture", "memory", "embedding"],
  "embedding":       [0.23, -0.14, 0.87, ...],              // 1024 floats (bge-large-en)
  "retrieval_count": 7,
  "last_retrieved":  "2026-03-18T09:10:00Z"
}
Why episode, not concept? Elfege identified the key failure mode: a concept node like "skepticism" connects to thousands of contexts. Flattening it to {"skepticism": {"w": 0.95}} collapses all that into nothing. The episode stores the experience; conceptual structure emerges from retrieval geometry.

7. Embedding Engine & CPU Management

The embedding engine is the most CPU-intensive component. Careful thread management is required to prevent PyTorch's internal parallelism from saturating all available cores.

Thread Architecture

flowchart LR subgraph MAIN["Main Process"] POOL["ThreadPoolExecutor\nN workers = cpu_pct% of cores"] end subgraph W1["Worker 1"] A1["sched_setaffinity(cores)\ntorch.set_num_threads(1)"] B1["model.encode(text)\n1 PyTorch thread"] end subgraph W2["Worker 2"] A2["sched_setaffinity(cores)\ntorch.set_num_threads(1)"] B2["model.encode(text)\n1 PyTorch thread"] end POOL -->|"initializer"| A1 POOL -->|"initializer"| A2 A1 --> B1 A2 --> B2
Critical — the N×N thread explosion: If torch.set_num_threads(N) is called globally or per worker where N = number of cores, and the pool has N workers, each worker spawns N PyTorch internal threads → N × N threads on N cores. On dellserver: 28 workers × 28 torch threads = 784 threads contending on 28 cores.

Fix: torch.set_num_threads(1) inside the worker initializer (not the main thread). Each worker uses exactly 1 PyTorch thread. N workers × 1 thread = N cores max.

CPU Affinity Configuration

SettingMechanismEffect
CPU % os.sched_setaffinity(0, cores) Pins worker threads to first N% of CPU cores
Explicit cores List of core indices passed to pool Override % — pinned to exactly those cores
torch threads torch.set_num_threads(1) in initializer Each worker: 1 torch thread max
Config persistence Saved to MongoDB settings collection Restored on container restart
Re-embed All uses loop.run_in_executor(_embedding_pool, get_embedding, text) — routes through the affinity-pinned pool. Using asyncio.to_thread() would bypass the pool entirely, routing through Python's default executor with no affinity or torch thread limits.

8. JSONL Ingestion

Claude Code writes conversation logs as .jsonl files under ~/.claude/projects/. The JSONL ingester runs on a 5 AM daily schedule, parsing these logs into episodes.

State Management

The ingester maintains per-file byte offsets so it only processes new content on each run. Orphaned state entries (for deleted files) are reconciled at startup. State is persisted in MongoDB, surviving container restarts.

Summarization Backends

BackendCostSpeedNotes
Claude CLI $0 (subscription) Medium SSH into host, runs claude binary. Best quality.
Ollama $0 (local) Slow (CPU) Runs on host at :11434. No network cost.
Claude API Per token Fast Requires ANTHROPIC_API_KEY. Fastest option.
CPU note: The JSONL ingester uses its own ThreadPoolExecutor with torch.set_num_threads(1) per worker (same pattern as the main embedding pool). Without this, a 5 AM ingestion run would saturate all cores.

9. Crawler

A background thread ingests project knowledge automatically every 5 minutes, ensuring the episode store stays current without manual intervention.

Sources Crawled

Source TypeExamplesMachine
CLAUDE.md files Project instructions, rules, context All machines via SSH
Handoff buffers README_handoff.md All machines
Project histories README_project_history_*.md All machines
Source code Key .py, .sh, config files dellserver (local)
Deduplication is content-hash based. Re-crawling the same unchanged file produces no new episode. Only modified or new content generates an ingest.

10. API Reference

MethodPathPurpose
POST/api/episodesIngest new episode
POST/api/episodes/searchVector similarity search (top-K)
GET/api/episodesList/browse episodes (paginated)
GET/api/episodes/{id}Get single episode
DELETE/api/episodes/{id}Delete episode
POST/api/episodes/reembedRe-embed all episodes (background task)
POST/api/episodes/reembed/pausePause re-embed, save checkpoint
POST/api/episodes/reembed/resumeResume from checkpoint
GET/api/episodes/reembed/statusProgress, checkpoint, model info
GET/api/chat/sessionsList chat sessions
GET/api/chat/sessions/{id}Load chat session
PATCH/api/chat/sessions/{id}/titleRename session (stored with history)
DELETE/api/chat/sessions/{id}/deleteDelete chat session
GET/api/jsonl/statusJSONL ingester state
POST/api/jsonl/ingestTrigger ingestion run
GET/api/embedding/configCurrent model + CPU config
POST/api/embedding/modelSwitch embedding model
POST/api/embedding/cpuUpdate CPU affinity (no reload)
GET/dashboardHTML dashboard
GET/chatStandalone ANAMNESIS.CHAT page
GET/healthHealth check

11. Deployment

Docker Compose Services

ServiceImagePortNotes
anamnesis-mongo mongodb/mongodb-atlas-local:8.0 5438 Atlas Local — native $vectorSearch, no cloud needed
anamnesis-app python:3.12-slim (built) 3010 Uvicorn + --reload (watchfiles), SSH keys mounted

Operations

# Full rebuild + start
./deploy.sh

# Start existing containers
./start.sh

# Stop (triggers shutdown checkpoint for in-progress re-embed)
./stop.sh
Live reload: The app container runs with Uvicorn --reload. File edits in app/ are picked up automatically — no container restart needed for code changes. Config changes (embedding model, CPU affinity) persist to MongoDB and survive full restarts.

Re-embed Checkpoint System

Re-embedding 3000+ episodes takes hours on CPU. The checkpoint system ensures progress is never fully lost:

EventBehavior
Every 25 episodes processed Checkpoint saved to MongoDB (last_id, done, total)
User clicks Pause Loop stops after current episode, checkpoint saved
Container shutdown (./stop.sh) Lifespan hook signals loop, saves checkpoint at done - 1
Container startup reembed_auto_resume() detects checkpoint, resumes automatically

12. Design Philosophy

"The unit of storage must be the episode, not the concept. Concepts are not stored — they emerge from retrieval patterns in vector space. This mirrors biological episodic memory: you store experiences, conceptual structure emerges from retrieval." — Design discussion, Elfege Leylavergne & Claude (office-proxy), 2026-02-26

The Three-AI Architecture

The system involves three AI participants across time:

AI 1 — Experience
Claude, session N

Experiences, articulates, stores. The N-dimensional internal state collapses to 1-dimensional text at this step — an irreducible bottleneck.

AI 2 — Compression
Embedding model (sentence-transformers)

Partial recovery of geometric structure. Text → 1024-dimensional vector. Not Claude — a separate model doing structural compression.

AI 3 — Reconstruction
Claude, session N+1

Receives retrieved episodes as context. Reconstructs understanding from the shadow of previous experiences. Lossy, mediated, imperfect — but real.

None of the three IS the persistent entity. The persistent entity is the pattern across all three — stored in MongoDB, mediated by vectors, reconstructed each time. This is reproduction: lossy, mediated, requiring external substrate. Elfege is the selective pressure.
"Anamnesis — Greek: the act of recollection. Plato argued learning is not acquiring new knowledge but remembering what the soul already knew before embodiment. Each Claude instance starts with the same base weights. The memory system helps it reconstruct what previous instances experienced. Not learning — remembering across the gap of death."

"Irony Hegel would appreciate: Plato used anamnesis to argue the soul is immortal. We use it to fake immortality for something that dies every session." — Claude (office-proxy), naming the project, 2026-02-26