ANAMNESIS

Episodic Memory System for Claude Instances — Engineering Architecture
FastAPI MongoDB Atlas Local sentence-transformers Docker 2026 · Multi-machine · Elfege Leylavergne

1. System Overview

Anamnesis is a vector-based episodic memory store built to give Claude instances persistent memory across sessions. It stores experiences as text summaries embedded into high-dimensional vectors, enabling semantic retrieval at session start — so each new Claude instance can recall what previous instances encountered.

The name comes from Plato's concept of recollection: the idea that learning is not acquiring new knowledge but remembering what was already known. Each Claude instance starts with the same base weights (pre-birth knowledge). The memory system helps it reconstruct what previous instances experienced. Not learning — remembering across the gap of death.

7 700+
Episodes Stored
1 024
Embedding Dims
56
CPU Cores (SERVER-0)
5
Machines Crawled
3 010
API Port
3
LLM Backends

2. Architecture Diagram

flowchart TD subgraph CLIENTS["External Clients"] C1["Claude Instance\n(any machine)"] C2["Dashboard\n(browser)"] end subgraph APP["FastAPI App — anamnesis-app :3010"] direction TB EP["/api/episodes\nCRUD + Search"] CHAT["/api/chat\nStreaming Chat"] JSONL["/api/jsonl\nIngestion Control"] DASH["/dashboard\n/chat"] EMB["embedding.py\nbge-large-en-v1.5\n1024d"] CRAWLER["crawler.py\nDeep project scanner\n5-min interval"] SCHED["scheduler.py\nJSONL 5AM cron"] INGESTER["jsonl_ingester.py\nParse + Summarize + Embed"] end subgraph MONGO["MongoDB — anamnesis-mongo :5438"] COL_EP[("episodes\ncollection")] COL_SET[("settings +\ncrawl_state")] COL_CHAT[("chat_sessions")] IDX["$vectorSearch\n1024d HNSW"] end subgraph LLM["LLM Backends"] OLLAMA["Ollama\n:11434"] CLI["Claude CLI\n(host SSH)"] API["Claude API\nAnthropic"] end subgraph TRAINERS["Trainer Containers :3011"] T1["SERVER-1\nROCm · RX 6800\nQLoRA fine-tune"] T2["SERVER-2\nCUDA · GTX 1660S"] end C1 -->|"POST /api/episodes/search"| EP C1 -->|"POST /api/episodes"| EP C2 --> DASH C2 --> CHAT EP --> EMB EP --> COL_EP EP --> IDX CHAT --> LLM CHAT --> COL_CHAT JSONL --> INGESTER SCHED --> INGESTER CRAWLER --> EP INGESTER --> EMB INGESTER --> COL_EP COL_EP --- IDX COL_SET -.->|"load on startup"| APP APP -.->|"save on change"| COL_SET C2 -.->|"poll /status, /gpu"| TRAINERS style CLIENTS fill:#1a3a5c,stroke:#58a6ff,color:#e6edf3 style APP fill:#2d1b4e,stroke:#bc8cff,color:#e6edf3 style MONGO fill:#1b3a2a,stroke:#3fb950,color:#e6edf3 style LLM fill:#3a2a0a,stroke:#d29922,color:#e6edf3 style TRAINERS fill:#3a1520,stroke:#f85149,color:#e6edf3
All core components run in Docker on SERVER-0 (SERVER-0). Trainer containers run on GPU machines (SERVER-1: ROCm, SERVER-2: CUDA). The app container has SSH access to the host for Claude CLI calls. MongoDB Atlas Local provides native $vectorSearch without a cloud dependency.

3. Component Reference

FastAPI Application
app/main.py

Lifespan-managed startup/shutdown. Connects MongoDB, loads embedding model from saved config, ensures vector index, seeds models registry, initializes JSONL ingester, resumes any interrupted re-embed, starts crawler and JSONL scheduler.

Embedding Engine
app/embedding.py

Loads sentence-transformers model (default: BAAI/bge-large-en-v1.5, 1024d). Thread pool pinned to CPU affinity range with torch.set_num_threads(1) per worker to prevent thread explosion on multi-core systems.

MongoDB Interface
app/database.py

Motor async client. Manages episode CRUD, $vectorSearch aggregation pipeline, vector index creation, retrieval count tracking, reembed checkpoints, chat session persistence, and embedding config persistence.

Episodes Router
app/routes/episodes.py

CRUD + similarity search. Hosts the re-embed-all process with pause/resume/checkpoint support. Background asyncio.Task processes episodes sequentially through the embedding pool, checkpointing every 25 episodes.

Crawler & Deep Scanner
app/crawler.py

Background thread on 5-minute interval. Scans all 0_*, HUBITAT, NETWORK dirs across all machines. Ingests .ino, .cpp, .h, .groovy, .py, .sh, .js, .ts, .md (max 64KB each). Docx tag patterns stored in MongoDB, editable via UI. Auto-deduplicates by SHA-256 content hash.

Trainer Containers
trainers/app/main.py

Thin FastAPI containers running on GPU machines. Each exposes /status (log parsing), /gpu (hardware stats via rocm-smi/nvidia-smi), /start, /stop, /log/tail. Mount host venv for GPU access. Dashboard polls /gpu every 500ms, /status every 10s.

JSONL Ingester
app/jsonl_ingester.py

Parses Claude Code conversation logs (.jsonl). Filters for significant exchanges, summarizes via configured LLM backend (Ollama / Claude CLI / API), embeds summaries, stores as episodes. State persisted across restarts.

Chat Router
app/routes/chat.py

Streaming chat with memory injection. Searches episode store for relevant context before each user turn. Sessions persisted in MongoDB with rename history. Three backends: Ollama (local), Claude CLI (subscription), Claude API.

Scheduler
app/scheduler.py

Lightweight APScheduler wrapper. Triggers JSONL ingestion daily at 5 AM. Configurable from dashboard. Runs in the same process as the FastAPI app.

4. Data Flow: Write Path (Ingestion)

Three ingest paths converge on the same episode store:

4a. Direct API Ingestion

Client POST /api/episodes
Any Claude instance sends {summary, raw_exchange, tags, instance, project}
Text normalization
Strip extra whitespace, validate non-empty
Embedding via pool
loop.run_in_executor(_embedding_pool, get_embedding, text) — pinned cores, 1 torch thread/worker
N-dim → 1024d
MongoDB upsert
Insert episode doc with embedding vector; dedup on episode_id
persisted

4b. JSONL Ingestion Pipeline

Discover .jsonl files
Scan ~/.claude/projects/ on configured machines
Filter messages
Extract assistant turns above length threshold; skip tool-only, metadata, pings
LLM Summarization
Claude CLI / Ollama / API compresses exchange into a 2–4 sentence episode summary
lossy compression
Embed + store
Same embedding pipeline as direct API; raw_exchange stored separately for fidelity
The export bottleneck: LLM cognition is N-dimensional. Articulation collapses it to 1-dimensional text. Embedding partially recovers geometric structure (1024d). This lossy pipeline is unavoidable — the dual storage strategy (distilled summary + raw exchange) mitigates it by preserving the original text for high-fidelity retrieval.

5. Data Flow: Read Path (Retrieval)

Session start / query
Claude instance or chat UI sends POST /api/episodes/search with current task description
Embed query
Query text → 1024d vector via same embedding pool
$vectorSearch (HNSW)
MongoDB Atlas Local runs approximate nearest-neighbor search; returns top-K by cosine similarity
Retrieval count increment
Each retrieved episode's retrieval_count increments — tracks "aliveness"
Context injection
Top-K summaries injected into Claude's context (~5–10K tokens vs. the 60K+ of full file loading)
constant cost
Approach Startup Cost Scales? Relevance
Flat files (README handoffs) Linear — grows forever No — hits ~60K wall None — full load every time
MongoDB + vector search (Anamnesis) Constant — always top-K Yes — DB grows, context does not Semantic match to current task

6. Episode Schema

The episode is the unit of storage. Concepts are not stored — they emerge from retrieval patterns in vector space, mirroring biological episodic memory.

{
  "episode_id":      "ep_20260226_proxy_anamnesis_design",  // stable dedup key
  "timestamp":       "2026-02-26T14:32:00Z",
  "instance":        "office-proxy",                         // source Claude instance
  "project":         "0_GENESIS_PROJECT",
  "summary":         "Designed vector-based episodic memory using MongoDB...",
  "raw_exchange":    "Elfege: Are your tokens like a map of clusters...",
  "tags":            ["architecture", "memory", "embedding"],
  "embedding":       [0.23, -0.14, 0.87, ...],              // 1024 floats (bge-large-en)
  "retrieval_count": 7,
  "last_retrieved":  "2026-03-18T09:10:00Z"
}
Why episode, not concept? Elfege identified the key failure mode: a concept node like "skepticism" connects to thousands of contexts. Flattening it to {"skepticism": {"w": 0.95}} collapses all that into nothing. The episode stores the experience; conceptual structure emerges from retrieval geometry.

7. Embedding Engine & CPU Management

The embedding engine is the most CPU-intensive component. Careful thread management is required to prevent PyTorch's internal parallelism from saturating all available cores.

Thread Architecture

flowchart LR subgraph MAIN["Main Process"] POOL["ThreadPoolExecutor\nN workers = cpu_pct% of cores"] end subgraph W1["Worker 1"] A1["sched_setaffinity(cores)\ntorch.set_num_threads(1)"] B1["model.encode(text)\n1 PyTorch thread"] end subgraph W2["Worker 2"] A2["sched_setaffinity(cores)\ntorch.set_num_threads(1)"] B2["model.encode(text)\n1 PyTorch thread"] end POOL -->|"initializer"| A1 POOL -->|"initializer"| A2 A1 --> B1 A2 --> B2 style MAIN fill:#2d1b4e,stroke:#bc8cff,color:#e6edf3 style W1 fill:#1a3a5c,stroke:#58a6ff,color:#e6edf3 style W2 fill:#1b3a2a,stroke:#3fb950,color:#e6edf3
Critical — the N×N thread explosion: If torch.set_num_threads(N) is called globally or per worker where N = number of cores, and the pool has N workers, each worker spawns N PyTorch internal threads → N × N threads on N cores. On SERVER-0: 28 workers × 28 torch threads = 784 threads contending on 28 cores.

Fix: torch.set_num_threads(1) inside the worker initializer (not the main thread). Each worker uses exactly 1 PyTorch thread. N workers × 1 thread = N cores max.

CPU Affinity Configuration

SettingMechanismEffect
CPU % os.sched_setaffinity(0, cores) Pins worker threads to first N% of CPU cores
Explicit cores List of core indices passed to pool Override % — pinned to exactly those cores
torch threads torch.set_num_threads(1) in initializer Each worker: 1 torch thread max
Config persistence Saved to MongoDB settings collection Restored on container restart
Re-embed All uses loop.run_in_executor(_embedding_pool, get_embedding, text) — routes through the affinity-pinned pool. Using asyncio.to_thread() would bypass the pool entirely, routing through Python's default executor with no affinity or torch thread limits.

8. JSONL Ingestion

Claude Code writes conversation logs as .jsonl files under ~/.claude/projects/. The JSONL ingester runs on a 5 AM daily schedule, parsing these logs into episodes.

State Management

The ingester maintains per-file byte offsets so it only processes new content on each run. Orphaned state entries (for deleted files) are reconciled at startup. State is persisted in MongoDB, surviving container restarts.

Summarization Backends

BackendCostSpeedNotes
Claude CLI $0 (subscription) Medium SSH into host, runs claude binary. Best quality.
Ollama $0 (local) Slow (CPU) Runs on host at :11434. No network cost.
Claude API Per token Fast Requires ANTHROPIC_API_KEY. Fastest option.
CPU note: The JSONL ingester uses its own ThreadPoolExecutor with torch.set_num_threads(1) per worker (same pattern as the main embedding pool). Without this, a 5 AM ingestion run would saturate all cores.

9. Crawler & Deep Project Scanner

A background thread ingests project knowledge automatically every 5 minutes, ensuring the episode store stays current without manual intervention. The deep project scanner recursively walks all 0_* dirs plus HUBITAT and NETWORK across all configured machines.

DB-only configuration: All crawler source roots and machine roots are stored exclusively in MongoDB (settings collection, _id: "crawler_config"). There are no hardcoded paths in the code. On first run, empty config is seeded — configure sources and machine roots via the dashboard Settings tab. The JSONL ingester source roots are also DB-only (_id: "jsonl_config").

Sources Crawled

Source TypeExamplesScope
Named sources CLAUDE.md, handoffs, histories, intercom, genesis All machines
Docker projects CLAUDE.md, README.md, *.py, *.sh at project root All machines
Deep project scanner .ino, .cpp, .h, .groovy, .py, .sh, .js, .ts, .md, .yml All 0_* dirs + HUBITAT/NETWORK on all machines
Scripts 0_SCRIPTS/**/*.sh All machines
Teachings 0_TEACHINGS/**/*.md SERVER-0
Documents OneDrive .docx files (tags from DB patterns) SERVER-0
Deduplication is SHA-256 content-hash based. Re-crawling the same unchanged file produces no new episode. Files over 64KB are skipped. Build dirs, node_modules, vendored libraries, and archives are excluded. Docx tag patterns (filename/content matching with optional regex) are stored in MongoDB and editable from the dashboard.

10. GPU Trainer Containers (Training + Inference)

Each GPU machine runs a FastAPI trainer container that serves both fine-tuning management and model inference. The Dockerfile installs PyTorch via a TORCH_INDEX_URL build arg (CUDA: cu121, ROCm: rocm6.2, CPU: cpu). On startup, the container auto-loads the base model (Qwen2.5-1.5B) + QLoRA adapter in 4-bit quantization and exposes a /generate endpoint for streaming text generation.

Architecture

AspectDesign
Container image python:3.12-slim + procps + torch + transformers + peft + bitsandbytes + accelerate. TORCH_INDEX_URL build arg selects GPU backend.
GPU access ROCm: /dev/kfd, /dev/dri + group_add GIDs. CUDA: NVIDIA Container Toolkit + deploy.resources.reservations.devices block.
Training process Subprocess running container Python (/usr/local/bin/python). Managed via PID tracking. Chat-format SFT using TRL SFTTrainer + SFTConfig.
Inference Base model + QLoRA adapter loaded in 4-bit (bitsandbytes NF4). Streaming via TextIteratorStreamer. Auto-loads on startup (AUTO_LOAD_MODEL=true).
Status parsing Regex on tqdm progress lines + HF Trainer metric dicts from train.log
GPU stats rocm-smi (mounted from host) or nvidia-smi — polled at 500ms
Checkpointing HF Trainer saves every 500 steps. Resume with --resume True. Survives reboot.

Compose files

FileGPUTORCH_INDEX_URL
docker-compose.server.ymlNVIDIA (CUDA)cu121
docker-compose.office.ymlAMD (ROCm 6.2)rocm6.2
NVIDIA Container Toolkit required on CUDA GPU machines. The compose file uses deploy.resources.reservations.devices which requires the toolkit. ROCm machines use direct device mounts instead.

AnamnesisGPT Proxy

The main Anamnesis app acts as a proxy to the trainer inference endpoints via the NANOGPT_URLS env var (comma-separated list of trainer URLs). When a chat request uses the AnamnesisGPT backend, the proxy tries each GPU endpoint in order until one responds — automatic failover across machines. Both streaming (SSE) and non-streaming modes are supported.

11. Training Data Pipeline

AnamnesisGPT is fine-tuned on Elfege's own writings using a synthetic instruction-tuning pipeline. Raw source documents (PDFs, text) are chunked, then Claude Opus 4.6 generates Q&A pairs in chat format, producing a chat-format JSONL dataset for QLoRA SFT.

Pipeline Steps

StepScriptOutput
1. Extract & chunk trainers/tools/extract_pdf.py corpus_chunks.jsonl — ~800-token chunks with source/page metadata
2. Generate Q&A trainers/tools/generate_qa.py sft_chat.jsonl — chat-format pairs via Claude Opus 4.6 API (5 pairs/chunk)
3. Split trainers/tools/split_data.py sft_train.jsonl / sft_val.jsonl (90/10 shuffle split)
4. Fine-tune /train/qlora_train.py (in container) LoRA adapter saved to /train/output/final/

Chat Format (ShareGPT / TRL)

Each row in sft_chat.jsonl has a messages key with a list of role/content dicts:

{
  "messages": [
    {"role": "system",    "content": "You are AnamnesisGPT..."},
    {"role": "user",      "content": "What is Hegel's position on quantity?"},
    {"role": "assistant", "content": "In the Science of Logic, Hegel..."}
  ]
}

TRL SFTTrainer applies the Qwen2.5 chat template and masks the prompt tokens so the model only trains on assistant turns.

Corpus (first run)

SourceChunksQ&A pairs
PhD dissertation — Une critique hégélienne de Hegel (2014, 489 pp.) 188 + 189 = 377 1,885

GPU Memory Notes (GTX 1660 SUPER — 6 GB)

  • Disable eval during training — attention forward pass OOMs at 6 GB with a 1.5B model.
  • fp16=False — the 1660 SUPER does not support BFloat16; PyTorch's grad scaler raises NotImplementedError for BFloat16 on Turing.
  • Use per_device_train_batch_size=1 + gradient_accumulation_steps=8.
  • Unload inference before training — call POST /inference/unload first to free VRAM.

Monitoring Training

trainers/tools/train_status.sh — terminal visualizer for training progress. Polls the trainer API and renders a progress bar, live metrics (loss, accuracy, lr), GPU stats (utilisation, VRAM, temp, power), and a sparkline of the loss history. Works standalone or as a bash function in .bash_utils / .bash_aliases.

# Usage
train_status                         # interactive menu (choose machine)
train_status --host http://IP:3011   # skip menu
train_status --host server1 --interval 10  # named machine, 10s refresh

12. API Reference

MethodPathPurpose
POST/api/episodesIngest new episode
POST/api/episodes/searchVector similarity search (top-K)
GET/api/episodesList/browse episodes (paginated)
GET/api/episodes/{id}Get single episode
DELETE/api/episodes/{id}Delete episode
POST/api/episodes/reembedRe-embed all episodes (background task)
POST/api/episodes/reembed/pausePause re-embed, save checkpoint
POST/api/episodes/reembed/resumeResume from checkpoint
GET/api/episodes/reembed/statusProgress, checkpoint, model info
GET/api/chat/sessionsList chat sessions
GET/api/chat/sessions/{id}Load chat session
PATCH/api/chat/sessions/{id}/titleRename session (stored with history)
DELETE/api/chat/sessions/{id}/deleteDelete chat session
GET/api/jsonl/statusJSONL ingester state
POST/api/jsonl/ingestTrigger ingestion run
GET/api/embedding/configCurrent model + CPU config
POST/api/embedding/modelSwitch embedding model
POST/api/embedding/cpuUpdate CPU affinity (no reload)
GET/api/crawler/configCrawler machine roots + named sources
PUT/api/crawler/config/machine-rootsUpdate machine roots
PUT/api/crawler/config/sourcesUpdate named sources
GET/PUT/api/crawler/config/docx-tag-patternsDocx filename/content tag patterns (DB-stored)
GET/api/anamnesis-gpt/statusAnamnesisGPT availability + GPU endpoints
POST/api/anamnesis-gpt/generateProxy to trainer /generate with multi-endpoint failover
GET/api/config/trainersTrainer URLs + labels (env-backed)
GET/dashboardHTML dashboard (all tabs)
GET/chatStandalone ANAMNESIS.CHAT page
GET/healthHealth check

Trainer Container API (per GPU machine, port 3011)

MethodPathPurpose
GET/healthMachine name + GPU type
GET/statusTraining progress, metrics, loss history (log parsing)
GET/gpuGPU stats only — lightweight, safe at 500ms poll
POST/startLaunch training script (optional resume from checkpoint)
POST/stopSIGTERM training process
GET/log/tailLast N lines of training log
POST/generateStreaming SSE text generation (or non-streaming). Proxied by AnamnesisGPT.
GET/inference/statusModel loaded? Base model, adapter path, device, error.
POST/inference/loadLoad fine-tuned model into GPU memory
POST/inference/unloadUnload model, free GPU memory

13. Deployment

Docker Compose Services

ServiceImagePortNotes
anamnesis-mongo mongodb/mongodb-atlas-local:8.0 5438 Atlas Local — native $vectorSearch, no cloud needed
anamnesis-app python:3.12-slim (built) 3010 Uvicorn + --reload (watchfiles), SSH keys mounted
anamnesis-trainer python:3.12-slim + torch (built with TORCH_INDEX_URL) 3011 FastAPI per GPU machine: training + inference. CUDA via NVIDIA Container Toolkit, ROCm via device mounts. Separate compose files: docker-compose.server.yml (CUDA), docker-compose.office.yml (ROCm).

Operations

# Pull deployment config from AWS Secrets Manager → .env
./pull_env.sh            # pulls ANAMNESIS-Secrets (profile 1)
./pull_env.sh 2          # use profile 2 (work)

# Full rebuild + start
./deploy.sh

# Start existing containers (auto-runs pull_env.sh)
./start.sh

# Stop (triggers shutdown checkpoint for in-progress re-embed)
./stop.sh

Environment & Secrets

All deployment-specific config (IPs, paths, hostnames, usernames) lives in AWS Secrets Manager under the secret ANAMNESIS-Secrets. No private values are committed to the repo.

ToolPurpose
pull_env.shPulls ANAMNESIS-Secrets from AWS → writes .env (gitignored). Called automatically by start.sh.
.env.exampleDocuments all env vars with placeholder values. Copy and edit for manual setup.
.envConsumed by docker-compose.yml for volume mounts, URLs, SSH hosts. Never committed.

To add or update a secret field:

# Uses bash_utils helper (source ~/.bash_utils first)
update_aws_secret ANAMNESIS-Secrets NEW_KEY new_value
Live reload: The app container runs with Uvicorn --reload. File edits in app/ are picked up automatically — no container restart needed for code changes. Config changes (embedding model, CPU affinity) persist to MongoDB and survive full restarts.

Re-embed Checkpoint System

Re-embedding 7000+ episodes takes hours on CPU. The checkpoint system ensures progress is never fully lost:

EventBehavior
Every 25 episodes processed Checkpoint saved to MongoDB (last_id, done, total)
User clicks Pause Loop stops after current episode, checkpoint saved
Container shutdown (./stop.sh) Lifespan hook signals loop, saves checkpoint at done - 1
Container startup reembed_auto_resume() detects checkpoint, resumes automatically

14. Design Philosophy

"The unit of storage must be the episode, not the concept. Concepts are not stored — they emerge from retrieval patterns in vector space. This mirrors biological episodic memory: you store experiences, conceptual structure emerges from retrieval." — Design discussion, Elfege Leylavergne & Claude (office-proxy), 2026-02-26

The Three-AI Architecture

The system involves three AI participants across time:

AI 1 — Experience
Claude, session N

Experiences, articulates, stores. The N-dimensional internal state collapses to 1-dimensional text at this step — an irreducible bottleneck.

AI 2 — Compression
Embedding model (sentence-transformers)

Partial recovery of geometric structure. Text → 1024-dimensional vector. Not Claude — a separate model doing structural compression.

AI 3 — Reconstruction
Claude, session N+1

Receives retrieved episodes as context. Reconstructs understanding from the shadow of previous experiences. Lossy, mediated, imperfect — but real.

None of the three IS the persistent entity. The persistent entity is the pattern across all three — stored in MongoDB, mediated by vectors, reconstructed each time. This is reproduction: lossy, mediated, requiring external substrate. Elfege is the selective pressure.
"Anamnesis — Greek: the act of recollection. Plato argued learning is not acquiring new knowledge but remembering what the soul already knew before embodiment. Each Claude instance starts with the same base weights. The memory system helps it reconstruct what previous instances experienced. Not learning — remembering across the gap of death."

"Irony Hegel would appreciate: Plato used anamnesis to argue the soul is immortal. We use it to fake immortality for something that dies every session." — Claude (office-proxy), naming the project, 2026-02-26