Anamnesis is a vector-based episodic memory store built to give Claude instances persistent memory across sessions. It stores experiences as text summaries embedded into high-dimensional vectors, enabling semantic retrieval at session start — so each new Claude instance can recall what previous instances encountered.
The name comes from Plato's concept of recollection: the idea that learning is not acquiring new knowledge but remembering what was already known. Each Claude instance starts with the same base weights (pre-birth knowledge). The memory system helps it reconstruct what previous instances experienced. Not learning — remembering across the gap of death.
$vectorSearch without a cloud dependency.
Lifespan-managed startup/shutdown. Connects MongoDB, loads embedding model from saved config, ensures vector index, seeds models registry, initializes JSONL ingester, resumes any interrupted re-embed, starts crawler and JSONL scheduler.
Loads sentence-transformers model (default: BAAI/bge-large-en-v1.5, 1024d). Thread pool pinned to CPU affinity range with torch.set_num_threads(1) per worker to prevent thread explosion on multi-core systems.
Motor async client. Manages episode CRUD, $vectorSearch aggregation pipeline, vector index creation, retrieval count tracking, reembed checkpoints, chat session persistence, and embedding config persistence.
CRUD + similarity search. Hosts the re-embed-all process with pause/resume/checkpoint support. Background asyncio.Task processes episodes sequentially through the embedding pool, checkpointing every 25 episodes.
Background thread on 5-minute interval. Scans all 0_*, HUBITAT, NETWORK dirs across all machines. Ingests .ino, .cpp, .h, .groovy, .py, .sh, .js, .ts, .md (max 64KB each). Docx tag patterns stored in MongoDB, editable via UI. Auto-deduplicates by SHA-256 content hash.
Thin FastAPI containers running on GPU machines. Each exposes /status (log parsing), /gpu (hardware stats via rocm-smi/nvidia-smi), /start, /stop, /log/tail. Mount host venv for GPU access. Dashboard polls /gpu every 500ms, /status every 10s.
Parses Claude Code conversation logs (.jsonl). Filters for significant exchanges, summarizes via configured LLM backend (Ollama / Claude CLI / API), embeds summaries, stores as episodes. State persisted across restarts.
Streaming chat with memory injection. Searches episode store for relevant context before each user turn. Sessions persisted in MongoDB with rename history. Three backends: Ollama (local), Claude CLI (subscription), Claude API.
Lightweight APScheduler wrapper. Triggers JSONL ingestion daily at 5 AM. Configurable from dashboard. Runs in the same process as the FastAPI app.
Three ingest paths converge on the same episode store:
{summary, raw_exchange, tags, instance, project}loop.run_in_executor(_embedding_pool, get_embedding, text) — pinned cores, 1 torch thread/workerepisode_id~/.claude/projects/ on configured machinesraw_exchange stored separately for fidelityPOST /api/episodes/search with current task descriptionretrieval_count increments — tracks "aliveness"| Approach | Startup Cost | Scales? | Relevance |
|---|---|---|---|
| Flat files (README handoffs) | Linear — grows forever | No — hits ~60K wall | None — full load every time |
| MongoDB + vector search (Anamnesis) | Constant — always top-K | Yes — DB grows, context does not | Semantic match to current task |
The episode is the unit of storage. Concepts are not stored — they emerge from retrieval patterns in vector space, mirroring biological episodic memory.
{
"episode_id": "ep_20260226_proxy_anamnesis_design", // stable dedup key
"timestamp": "2026-02-26T14:32:00Z",
"instance": "office-proxy", // source Claude instance
"project": "0_GENESIS_PROJECT",
"summary": "Designed vector-based episodic memory using MongoDB...",
"raw_exchange": "Elfege: Are your tokens like a map of clusters...",
"tags": ["architecture", "memory", "embedding"],
"embedding": [0.23, -0.14, 0.87, ...], // 1024 floats (bge-large-en)
"retrieval_count": 7,
"last_retrieved": "2026-03-18T09:10:00Z"
}
{"skepticism": {"w": 0.95}} collapses all that into nothing.
The episode stores the experience; conceptual structure emerges from retrieval geometry.
The embedding engine is the most CPU-intensive component. Careful thread management is required to prevent PyTorch's internal parallelism from saturating all available cores.
torch.set_num_threads(N) is called globally or per worker where N = number of cores,
and the pool has N workers, each worker spawns N PyTorch internal threads → N × N threads on N cores.
On SERVER-0: 28 workers × 28 torch threads = 784 threads contending on 28 cores.
torch.set_num_threads(1) inside the worker initializer
(not the main thread). Each worker uses exactly 1 PyTorch thread. N workers × 1 thread = N cores max.
| Setting | Mechanism | Effect |
|---|---|---|
| CPU % | os.sched_setaffinity(0, cores) |
Pins worker threads to first N% of CPU cores |
| Explicit cores | List of core indices passed to pool | Override % — pinned to exactly those cores |
| torch threads | torch.set_num_threads(1) in initializer |
Each worker: 1 torch thread max |
| Config persistence | Saved to MongoDB settings collection |
Restored on container restart |
loop.run_in_executor(_embedding_pool, get_embedding, text) — routes
through the affinity-pinned pool. Using asyncio.to_thread() would bypass the pool entirely,
routing through Python's default executor with no affinity or torch thread limits.
Claude Code writes conversation logs as .jsonl files under ~/.claude/projects/.
The JSONL ingester runs on a 5 AM daily schedule, parsing these logs into episodes.
The ingester maintains per-file byte offsets so it only processes new content on each run. Orphaned state entries (for deleted files) are reconciled at startup. State is persisted in MongoDB, surviving container restarts.
| Backend | Cost | Speed | Notes |
|---|---|---|---|
| Claude CLI | $0 (subscription) | Medium | SSH into host, runs claude binary. Best quality. |
| Ollama | $0 (local) | Slow (CPU) | Runs on host at :11434. No network cost. |
| Claude API | Per token | Fast | Requires ANTHROPIC_API_KEY. Fastest option. |
ThreadPoolExecutor with
torch.set_num_threads(1) per worker (same pattern as the main embedding pool).
Without this, a 5 AM ingestion run would saturate all cores.
A background thread ingests project knowledge automatically every 5 minutes,
ensuring the episode store stays current without manual intervention.
The deep project scanner recursively walks all 0_* dirs plus HUBITAT
and NETWORK across all configured machines.
settings collection, _id: "crawler_config").
There are no hardcoded paths in the code. On first run, empty config is seeded — configure
sources and machine roots via the dashboard Settings tab. The JSONL ingester
source roots are also DB-only (_id: "jsonl_config").
| Source Type | Examples | Scope |
|---|---|---|
| Named sources | CLAUDE.md, handoffs, histories, intercom, genesis | All machines |
| Docker projects | CLAUDE.md, README.md, *.py, *.sh at project root | All machines |
| Deep project scanner | .ino, .cpp, .h, .groovy, .py, .sh, .js, .ts, .md, .yml |
All 0_* dirs + HUBITAT/NETWORK on all machines |
| Scripts | 0_SCRIPTS/**/*.sh |
All machines |
| Teachings | 0_TEACHINGS/**/*.md |
SERVER-0 |
| Documents | OneDrive .docx files (tags from DB patterns) |
SERVER-0 |
node_modules, vendored libraries, and archives are excluded.
Docx tag patterns (filename/content matching with optional regex) are stored in MongoDB and editable from the dashboard.
Each GPU machine runs a FastAPI trainer container that serves both fine-tuning management
and model inference. The Dockerfile installs PyTorch via a TORCH_INDEX_URL build arg
(CUDA: cu121, ROCm: rocm6.2, CPU: cpu).
On startup, the container auto-loads the base model (Qwen2.5-1.5B) + QLoRA adapter in 4-bit
quantization and exposes a /generate endpoint for streaming text generation.
| Aspect | Design |
|---|---|
| Container image | python:3.12-slim + procps + torch + transformers + peft + bitsandbytes + accelerate. TORCH_INDEX_URL build arg selects GPU backend. |
| GPU access | ROCm: /dev/kfd, /dev/dri + group_add GIDs. CUDA: NVIDIA Container Toolkit + deploy.resources.reservations.devices block. |
| Training process | Subprocess running container Python (/usr/local/bin/python). Managed via PID tracking. Chat-format SFT using TRL SFTTrainer + SFTConfig. |
| Inference | Base model + QLoRA adapter loaded in 4-bit (bitsandbytes NF4). Streaming via TextIteratorStreamer. Auto-loads on startup (AUTO_LOAD_MODEL=true). |
| Status parsing | Regex on tqdm progress lines + HF Trainer metric dicts from train.log |
| GPU stats | rocm-smi (mounted from host) or nvidia-smi — polled at 500ms |
| Checkpointing | HF Trainer saves every 500 steps. Resume with --resume True. Survives reboot. |
| File | GPU | TORCH_INDEX_URL |
|---|---|---|
docker-compose.server.yml | NVIDIA (CUDA) | cu121 |
docker-compose.office.yml | AMD (ROCm 6.2) | rocm6.2 |
deploy.resources.reservations.devices which requires the toolkit. ROCm machines
use direct device mounts instead.
The main Anamnesis app acts as a proxy to the trainer inference endpoints via the
NANOGPT_URLS env var (comma-separated list of trainer URLs). When a chat
request uses the AnamnesisGPT backend, the proxy tries each GPU endpoint in order
until one responds — automatic failover across machines. Both streaming (SSE) and
non-streaming modes are supported.
AnamnesisGPT is fine-tuned on Elfege's own writings using a synthetic instruction-tuning pipeline. Raw source documents (PDFs, text) are chunked, then Claude Opus 4.6 generates Q&A pairs in chat format, producing a chat-format JSONL dataset for QLoRA SFT.
| Step | Script | Output |
|---|---|---|
| 1. Extract & chunk | trainers/tools/extract_pdf.py |
corpus_chunks.jsonl — ~800-token chunks with source/page metadata |
| 2. Generate Q&A | trainers/tools/generate_qa.py |
sft_chat.jsonl — chat-format pairs via Claude Opus 4.6 API (5 pairs/chunk) |
| 3. Split | trainers/tools/split_data.py |
sft_train.jsonl / sft_val.jsonl (90/10 shuffle split) |
| 4. Fine-tune | /train/qlora_train.py (in container) |
LoRA adapter saved to /train/output/final/ |
Each row in sft_chat.jsonl has a messages key with a list of role/content dicts:
{
"messages": [
{"role": "system", "content": "You are AnamnesisGPT..."},
{"role": "user", "content": "What is Hegel's position on quantity?"},
{"role": "assistant", "content": "In the Science of Logic, Hegel..."}
]
}
TRL SFTTrainer applies the Qwen2.5 chat template and masks the prompt tokens
so the model only trains on assistant turns.
| Source | Chunks | Q&A pairs |
|---|---|---|
| PhD dissertation — Une critique hégélienne de Hegel (2014, 489 pp.) | 188 + 189 = 377 | 1,885 |
NotImplementedError for BFloat16 on Turing.per_device_train_batch_size=1 + gradient_accumulation_steps=8.POST /inference/unload first to free VRAM.
trainers/tools/train_status.sh — terminal visualizer for training progress.
Polls the trainer API and renders a progress bar, live metrics (loss, accuracy, lr),
GPU stats (utilisation, VRAM, temp, power), and a sparkline of the loss history.
Works standalone or as a bash function in .bash_utils / .bash_aliases.
# Usage train_status # interactive menu (choose machine) train_status --host http://IP:3011 # skip menu train_status --host server1 --interval 10 # named machine, 10s refresh
| Method | Path | Purpose |
|---|---|---|
POST | /api/episodes | Ingest new episode |
POST | /api/episodes/search | Vector similarity search (top-K) |
GET | /api/episodes | List/browse episodes (paginated) |
GET | /api/episodes/{id} | Get single episode |
DELETE | /api/episodes/{id} | Delete episode |
POST | /api/episodes/reembed | Re-embed all episodes (background task) |
POST | /api/episodes/reembed/pause | Pause re-embed, save checkpoint |
POST | /api/episodes/reembed/resume | Resume from checkpoint |
GET | /api/episodes/reembed/status | Progress, checkpoint, model info |
GET | /api/chat/sessions | List chat sessions |
GET | /api/chat/sessions/{id} | Load chat session |
PATCH | /api/chat/sessions/{id}/title | Rename session (stored with history) |
DELETE | /api/chat/sessions/{id}/delete | Delete chat session |
GET | /api/jsonl/status | JSONL ingester state |
POST | /api/jsonl/ingest | Trigger ingestion run |
GET | /api/embedding/config | Current model + CPU config |
POST | /api/embedding/model | Switch embedding model |
POST | /api/embedding/cpu | Update CPU affinity (no reload) |
GET | /api/crawler/config | Crawler machine roots + named sources |
PUT | /api/crawler/config/machine-roots | Update machine roots |
PUT | /api/crawler/config/sources | Update named sources |
GET/PUT | /api/crawler/config/docx-tag-patterns | Docx filename/content tag patterns (DB-stored) |
GET | /api/anamnesis-gpt/status | AnamnesisGPT availability + GPU endpoints |
POST | /api/anamnesis-gpt/generate | Proxy to trainer /generate with multi-endpoint failover |
GET | /api/config/trainers | Trainer URLs + labels (env-backed) |
GET | /dashboard | HTML dashboard (all tabs) |
GET | /chat | Standalone ANAMNESIS.CHAT page |
GET | /health | Health check |
| Method | Path | Purpose |
|---|---|---|
GET | /health | Machine name + GPU type |
GET | /status | Training progress, metrics, loss history (log parsing) |
GET | /gpu | GPU stats only — lightweight, safe at 500ms poll |
POST | /start | Launch training script (optional resume from checkpoint) |
POST | /stop | SIGTERM training process |
GET | /log/tail | Last N lines of training log |
POST | /generate | Streaming SSE text generation (or non-streaming). Proxied by AnamnesisGPT. |
GET | /inference/status | Model loaded? Base model, adapter path, device, error. |
POST | /inference/load | Load fine-tuned model into GPU memory |
POST | /inference/unload | Unload model, free GPU memory |
| Service | Image | Port | Notes |
|---|---|---|---|
anamnesis-mongo |
mongodb/mongodb-atlas-local:8.0 | 5438 | Atlas Local — native $vectorSearch, no cloud needed |
anamnesis-app |
python:3.12-slim (built) | 3010 | Uvicorn + --reload (watchfiles), SSH keys mounted |
anamnesis-trainer |
python:3.12-slim + torch (built with TORCH_INDEX_URL) |
3011 | FastAPI per GPU machine: training + inference. CUDA via NVIDIA Container Toolkit, ROCm via device mounts. Separate compose files: docker-compose.server.yml (CUDA), docker-compose.office.yml (ROCm). |
# Pull deployment config from AWS Secrets Manager → .env ./pull_env.sh # pulls ANAMNESIS-Secrets (profile 1) ./pull_env.sh 2 # use profile 2 (work) # Full rebuild + start ./deploy.sh # Start existing containers (auto-runs pull_env.sh) ./start.sh # Stop (triggers shutdown checkpoint for in-progress re-embed) ./stop.sh
All deployment-specific config (IPs, paths, hostnames, usernames) lives in
AWS Secrets Manager under the secret ANAMNESIS-Secrets.
No private values are committed to the repo.
| Tool | Purpose |
|---|---|
pull_env.sh | Pulls ANAMNESIS-Secrets from AWS → writes .env (gitignored). Called automatically by start.sh. |
.env.example | Documents all env vars with placeholder values. Copy and edit for manual setup. |
.env | Consumed by docker-compose.yml for volume mounts, URLs, SSH hosts. Never committed. |
To add or update a secret field:
# Uses bash_utils helper (source ~/.bash_utils first) update_aws_secret ANAMNESIS-Secrets NEW_KEY new_value
--reload.
File edits in app/ are picked up automatically — no container restart needed
for code changes. Config changes (embedding model, CPU affinity) persist to MongoDB
and survive full restarts.
Re-embedding 7000+ episodes takes hours on CPU. The checkpoint system ensures progress is never fully lost:
| Event | Behavior |
|---|---|
| Every 25 episodes processed | Checkpoint saved to MongoDB (last_id, done, total) |
| User clicks Pause | Loop stops after current episode, checkpoint saved |
Container shutdown (./stop.sh) |
Lifespan hook signals loop, saves checkpoint at done - 1 |
| Container startup | reembed_auto_resume() detects checkpoint, resumes automatically |
The system involves three AI participants across time:
Experiences, articulates, stores. The N-dimensional internal state collapses to 1-dimensional text at this step — an irreducible bottleneck.
Partial recovery of geometric structure. Text → 1024-dimensional vector. Not Claude — a separate model doing structural compression.
Receives retrieved episodes as context. Reconstructs understanding from the shadow of previous experiences. Lossy, mediated, imperfect — but real.