Building Persistent Memory for AI Agents

By Matthew Hunter | Mar 23, 2026 | memstore, mcp, claude-code, sqlite

AI coding assistants are goldfish with a PhD. They can solve complex problems within a single session — refactoring a module, debugging a race condition, designing an API — but the moment the conversation ends, everything they learned about your project evaporates. After months of building software with Claude Code, I found myself re-explaining the same project conventions, the same architectural decisions, the same mistakes we’d already caught and fixed, at the start of every session. CLAUDE.md files help, but they’re static and they don’t scale. You can’t stuff a dozen projects’ worth of design context into a single markdown file. So I built memstore: a persistent memory system that gives AI agents durable, searchable knowledge across sessions, projects, and machines.

This post is the flagship overview. The companion posts cover Claude Code hooks for context injection , fact supersession as version control for knowledge , and operational lessons from running 1,500 facts in production .

The Problem Space

Every conversation with an AI coding assistant starts from zero. The model has no memory of yesterday’s session where you explained the authentication architecture, or last week’s session where you decided to migrate from REST to gRPC, or the session before that where you fixed a subtle concurrency bug that the model itself introduced. You start over. Every time.

CLAUDE.md is Anthropic’s answer to this — a project-level markdown file that gets loaded into context automatically. It works for build commands and basic conventions, but it’s a single static file. It doesn’t scale to multiple projects, doesn’t evolve as decisions change, and has no search capability. You either put everything in it and burn context on irrelevant information, or you keep it minimal and re-explain everything else manually.

Copy-pasting context into prompts is the fallback, and it’s tedious and error-prone. You’re doing the memory system’s job by hand, and doing it worse than a machine would. Commercial memory features from the major AI providers do exist, but they’re opaque. You can’t see what’s stored, can’t control how it’s retrieved, and can’t self-host it. For a security engineer, storing project knowledge — architecture decisions, code patterns, infrastructure details — in a third-party black box is a non-starter. I wanted something I could audit, control, and run on my own hardware.

Design Principles

Memstore was built around five principles that came directly from the problems above.

Self-hosted and auditable. The data lives on hardware I control. The original implementation uses SQLite — a single file you can inspect with standard tools. The production backend uses PostgreSQL with pgvector. Either way, every stored fact is visible, queryable, and deletable. No black boxes.

Search over retrieval. A memory system that only does exact-match lookup isn’t much better than grep. Memstore uses hybrid search: FTS5 full-text search for keyword matching and vector embeddings for semantic similarity. When you search for “authentication flow,” it finds facts about “login” and “auth” and “credential validation” even if none of them contain the exact phrase you typed.

History over deletion. Knowledge evolves. The decision you made in week one might get revised in week three. Memstore doesn’t delete the old version — it creates a supersession chain, linking the new fact to the one it replaced. You can walk the chain to see how a decision evolved, which is useful both for understanding why something is the way it is and for catching regressions when an old decision gets accidentally re-introduced.

Minimal friction. Memstore implements the Model Context Protocol, which means Claude Code sees it as a set of native tools. The agent doesn’t need special instructions to use it — it calls memory_search and memory_store the same way it calls any other tool. No wrappers, no prompting tricks, no custom integrations.

Security by design. Metadata key validation prevents injection attacks against the query layer. Namespace isolation partitions data so a multi-tenant deployment can’t leak facts across boundaries. Input sanitization is applied at the store layer, not left to callers.

Architecture

The Store Interface

The core of memstore is the Store interface — a clean abstraction over fact storage that defines every operation the system supports. The interface is scoped by namespace: every method operates within a partition, so facts from different tenants or contexts never collide.

The original implementation backed this interface with SQLite and FTS5. When I needed network-accessible memory with better vector search, I built a second implementation on PostgreSQL with pgvector. The interface was clean enough that the PostgreSQL backend required no changes to mcpserver/, embedding.go, extract.go, or transfer.go. The store layer swapped out; everything above it stayed the same.

Eighteen MCP tools bridge through the mcpserver/ package, mapping protocol requests to store operations. The tools cover the full lifecycle: storing facts, searching, linking related facts, tracking supersession chains, managing tasks, learning codebases, and rating injected context.

Hybrid Search

Memstore’s search combines two approaches that compensate for each other’s weaknesses.

FTS5 handles keyword matching with BM25 scoring. It’s fast, precise, and excels at finding facts that contain specific terms. But it misses semantic relationships — a search for “authentication” won’t find a fact that only uses the word “login.”

Vector embeddings handle the semantic side. Each fact gets a 768-dimensional embedding from nomic-embed-text via Ollama. Cosine similarity between the query embedding and fact embeddings captures meaning beyond exact words. But vector search alone is noisy — it’ll happily return vaguely related facts that share a semantic neighborhood without actually addressing your query.

The hybrid scorer merges both signals with configurable weights: 0.6 for FTS and 0.4 for vector by default. FTS gets the higher weight because precision matters more than recall in this context — you’d rather miss a vaguely related fact than inject an irrelevant one into the model’s context window.

Temporal decay adds a time dimension. Facts in the “note” category decay over 30 days — they’re ephemeral by nature and should fade. Facts in stable categories like “preference,” “identity,” and “convention” don’t decay at all. A coding convention from month one is just as relevant as one from yesterday.

Fact Supersession

Facts form chains through superseded_by pointers. When you store a new fact that updates an existing one, the old fact gets marked as superseded and points to its replacement. The old fact stops appearing in search results, but it’s still in the database. You can walk the chain to see the full history of how a piece of knowledge evolved.

Auto-supersession handles the common case automatically. When a new fact has cosine similarity of 0.85 or higher to an existing fact with the same subject, the system assumes it’s an update and creates the supersession link without being asked. This threshold took tuning — lower values caused false merges where unrelated facts about the same subject replaced each other, and higher values caused duplicates where minor rephrasing created a new fact instead of updating the existing one. 0.85 turned out to be the right balance.

The MetadataConflicts guard prevents a subtle failure mode. Two facts can share a subject and be semantically similar while describing different things — say, two different API endpoints in the same service. If they carry different metadata values on shared keys, the guard blocks the auto-supersession. Without this, facts about different aspects of the same subject silently replaced each other. The guard was essential, and I didn’t anticipate needing it until it bit me. The supersession deep-dive post covers the mechanics and edge cases in detail.

The Three-Tier Context Model

The Codified Context paper describes a three-tier memory architecture developed over 283 sessions building a 108,000-line C# system. It’s the closest prior work to what memstore does, and reading it validated several design decisions I’d already made while revealing some I hadn’t considered.

Memstore implements a superset of their architecture. Where Codified Context uses keyword-only search, memstore adds semantic search via vector embeddings with hybrid FTS scoring. Where their system has a flat fact store, memstore has a structured taxonomy with categories, kinds, and explicit graph links between related facts. And where they rely on manual curation, memstore adds drift detection to flag facts that may have gone stale.

The three tiers map to different access patterns:

Hot context is CLAUDE.md — always loaded, containing build commands, project conventions, and baseline instructions. This is context the model needs on every prompt, regardless of what you’re working on.

Warm context is file-level and symbol-level facts, loaded by hooks when you touch a specific file. If you open auth.go, the hook surfaces facts about that file’s responsibilities, its invariants, and its relationship to other modules. This context is relevant but not universal — it fires on demand rather than on every prompt.

Cold context is the full search corpus, queried on demand through MCP tools. The model calls memory_search when it needs to look something up, the same way you’d search documentation. This is where the bulk of the fact store lives — design decisions, session summaries, cross-project knowledge — surfaced only when asked for.

Codebase Ingestion

The memory_learn tool walks Go source via AST and produces a four-level containment graph: repository to package to file to symbol. Each node gets a natural-language summary generated by a local Ollama model — currently mistral-small:24b running on Strix Halo hardware.

Incremental updates are handled via content hashing. Each file’s hash is stored with its facts, and on subsequent runs, only files that have actually changed get re-processed. This keeps the learn cycle fast even on large codebases.

Quality tagging acknowledges a practical reality: local model summaries are lower-signal than what a frontier model would produce. They tend toward GoDoc-level descriptions — “this function does X” — rather than architectural insight like “this function is the entry point for the auth flow and must stay in sync with the session middleware.” The learn pipeline tags these summaries as drafts, marking them for lazy upgrade. When a frontier model encounters these facts during actual work, it can replace them with richer descriptions. The quality improves incrementally as the system gets used, without requiring an expensive bulk reprocessing step.

memstored: The Network Daemon

The original memstore was a CLI tool and MCP server that each agent instance ran locally. That worked fine for a single workstation, but it meant every machine needed its own SQLite database and a local Ollama instance for computing embeddings. Cross-machine memory sharing wasn’t possible without manual export and import.

memstored is an HTTP/JSON daemon that centralizes memory on the network. It runs on a dedicated VM with PostgreSQL and pgvector, computing 768-dimensional nomic-embed-text embeddings for every stored fact. The daemon exposes the full set of memstore operations as a REST API, and a thin MCP shim translates between the MCP protocol and HTTP calls.

The async embedding queue was an important architectural choice. When a fact is stored, it’s immediately persisted with its text, metadata, and category — available for FTS search right away. The embedding computation happens in the background. This means the store operation returns in milliseconds rather than blocking on an Ollama inference call, and the fact becomes available for vector search shortly after. For bulk operations like codebase learning, where hundreds of facts get stored in rapid succession, this is the difference between a usable tool and a bottleneck.

The /v1/recall endpoint is purpose-built for hook injection. A Claude Code hook fires on every prompt and sends the user’s text to memstored. The server extracts keywords using IDF scoring against the full corpus, runs the hybrid search, applies project and file-context boosts, deduplicates against facts already injected in the current session, and returns a curated context block. The hook injects this block into the conversation, and the model sees it as part of the prompt context.

Session-aware deduplication keeps the context window clean across a long coding session. memstored tracks which facts have been injected per session and never repeats them. The first prompt gets the foundational context; subsequent prompts get increasingly specific facts that the earlier injections didn’t cover.

The latency improvement alone justified the move. The hooks previously spawned a CLI process on every invocation — about 200ms of overhead per prompt. The HTTP call to memstored takes about 30ms. When a hook fires on every prompt, that difference adds up.

The Hook System

Five Claude Code hooks wire memstore into the agent’s lifecycle, surfacing context automatically without the agent needing to search for it. The hooks handle prompt-level recall, file-open context injection, session tracking, and task surfacing. The details deserve their own post — see Proactive Context Injection with Claude Code Hooks .

What I’ve Learned

What Works

Supersession chains are the right primitive for evolving knowledge. The alternative — overwriting facts in place — loses history and makes it impossible to understand how a decision changed over time. Chains preserve the full lineage, and the auto-supersession mechanism means you rarely need to manage them manually.

Task surfacing across sessions genuinely solves the continuity problem. When a session ends with identified follow-up work, that work gets stored as a task and surfaces automatically at the start of the next session. The agent doesn’t need to be told “we were working on X” — it already knows.

Hybrid search outperforms either FTS or vector alone for this use case. FTS gives precision when you know the right keywords; vector search gives recall when you don’t. The combination catches what each misses individually.

The MCP protocol is the right integration point. It’s standardized, tool-native, and supported by the major AI coding assistants. Building on MCP means memstore works with any MCP-compatible agent, not just Claude Code.

What Doesn’t (Yet)

Local model summaries are low-signal. The mistral-small:24b summaries generated during codebase learning are accurate but shallow — they describe what a function does without capturing why it matters or how it fits into the broader architecture. The quality-tagging system is the mitigation, but lazy upgrade only works when a frontier model actually encounters and improves the facts during normal use. For a large codebase with hundreds of symbols, most summaries will remain at draft quality for a long time.

Vector search at scale needs approximate nearest-neighbor indexes. The current implementation does exact cosine similarity, which works at 1,500 facts but won’t scale to tens of thousands. pgvector supports HNSW indexes, and that’s the planned path forward.

Surprises

Temporal decay matters more than I expected. Without it, notes from early development sessions — exploratory ideas, temporary workarounds, half-formed plans — accumulate and crowd out current facts in search results. The 30-day decay on the “note” category was one of those changes that seemed minor and turned out to be essential.

The 0.85 cosine threshold for auto-supersession was arrived at empirically. I started at 0.80 and watched unrelated facts merge. I moved to 0.90 and watched near-duplicates pile up. 0.85 has been stable for weeks with no false merges and no significant duplicate accumulation.

The metadata conflict guard was the change I didn’t know I needed. Without it, two facts about the same subject — say, different endpoints in an API — would silently replace each other if they were semantically similar enough. The guard checks for conflicting metadata values before allowing auto-supersession, and it’s prevented dozens of false merges since I added it.

By the Numbers

~132 commits over four weeks (February 15 to March 14, 2026)
~1,500 active facts across a dozen projects
Schema version 7, with cumulative migrations V1 through V7
18 MCP tools, 5 Claude Code hooks
~28,000 lines of Go

What’s Next

HNSW vector indexes. Exact cosine similarity search is fine at 1,500 facts but will become a bottleneck as the corpus grows. pgvector’s HNSW support is the natural upgrade path, and the search interface already abstracts the backend well enough that adding an index type shouldn’t require changes above the store layer.

Non-code ingestion. The codebase learner handles Go source, but projects generate knowledge in other forms — design documents, wiki pages, conversation transcripts, API specifications. A general ingestion pipeline that can process Markdown, HTML, and plain text would expand the knowledge base significantly without requiring manual fact creation.

Curator model. Right now, the recall pipeline decides what to inject based on scoring heuristics. A small local model sitting between the search results and the injection point could filter and rank more intelligently — considering not just relevance but redundancy with what the model already knows from previous turns. This is the kind of problem that’s too nuanced for scoring weights but well within reach of a 7B model doing classification.

Project: github.com/matthewjhunter/memstore

Prior work: Codified Context: How AI Assistants Learn Across Conversations

Certifications

Publications