Operational Lessons from 1,500 Memstore Facts

By Matthew Hunter | Mar 26, 2026 | memstore, mcp, claude-code, information-retrieval

Building a persistent memory system for AI agents is one problem. Operating it at scale is a different one. After two months of daily use, memstore holds around 1,500 active facts across a dozen projects — project architecture, coding conventions, design decisions, cross-cutting invariants, and roughly a thousand symbol-level descriptions generated by running an LLM over every Go function in every codebase I work on. The system works. The recall pipeline surfaces relevant context on every prompt without manual search. But getting from “works” to “works well” required a series of scoring adjustments, noise filters, and feedback mechanisms that the original design didn’t anticipate.

This is the tuning story.

The Noise Problem

The first version of the recall pipeline was simple: extract keywords from the user’s prompt, search the fact store, return the top results. It worked in early testing when the database held a hundred facts, most of them hand-written project descriptions and conventions. Every fact was high-signal because a human had written it.

Then I ran memstore learn across my codebases.

The codebase learner walks Go source via AST, extracts every package, file, and exported symbol, and feeds each through a local Ollama model for summarization. The result is a four-level containment graph — repo, package, file, symbol — with a natural-language description at every node. This is useful context when you’re editing a specific file: the hook can surface “this function does X and must stay in sync with Y” before Claude touches it.

The problem is that these summaries outnumber everything else ten to one. A search for “authentication” returns twelve facts about auth-related functions and buries the design decision explaining why we chose that auth approach. The GoDoc-level descriptions are individually correct and collectively useless for recall — they’re reference material masquerading as context.

Demoting Symbols

The fix was a scoring adjustment in the recall pipeline, not a change to what gets stored. Symbol facts carry surface: "symbol" in their metadata. During recall scoring, any fact with that metadata gets a 0.2x multiplier — an 80% penalty. They still exist in the database. They still surface through the file-specific hooks when you’re reading or editing the file they describe. But they no longer crowd out architectural facts in prompt-level recall.

This was the single highest-impact change to recall quality. The second-highest was related.

Kind-Based Scoring

Not all facts are equal, and the system should know it. Facts in memstore carry a kind field — decision, convention, invariant, pattern, trigger, among others. The original scoring treated them all the same. But when you’re asking about how something works, a design decision is worth more than a code pattern, and a cross-cutting invariant is worth more than either.

The scoring now boosts decision facts by 1.5x and convention and invariant facts by 1.3x. Decisions outrank everything because they capture why — the reasoning that doesn’t survive in code. Conventions and invariants outrank patterns because they represent constraints that apply across files.

Combined with symbol demotion, the effect is dramatic. A search that used to return ten function signatures now returns the design decision, two conventions, and a couple of genuinely relevant patterns, with function signatures pushed below the score threshold.

The Smart Project Boost

Working in a project should bias recall toward that project’s facts — if I’m in the memstore repo, memstore facts should outrank Herald facts for the same query. The original implementation was a blanket 2.5x boost for facts matching the current project.

This interacted badly with the symbol problem. The 2.5x project boost on a symbol fact from the current project made it score higher than a decision fact from the same project. The boost intended to surface project-relevant context was actually amplifying the noise.

The fix: the project boost no longer applies to symbol-surface facts. Symbols already get surfaced through the file-level hooks, where they belong. At the prompt level, the project boost applies only to the facts that benefit from it — decisions, conventions, invariants, and patterns.

IDF Keyword Extraction

The recall pipeline doesn’t use the raw prompt as a search query. It extracts keywords using IDF (inverse document frequency) scoring, which measures how distinctive each word is across the entire fact corpus.

The word “function” appears in hundreds of facts and has low IDF — searching for it returns noise. The word “supersession” appears in a handful of facts and has high IDF — it’s a strong signal for what the user actually cares about. The pipeline extracts up to five keywords, filtered by a dynamic IDF floor set at 15% of log(N) where N is the corpus size, with an absolute minimum of 0.5.

This matters more than it sounds. Early recall used the full prompt as a search query. A prompt like “fix the authentication bug in the login handler” would match every fact mentioning “the” and “in” along with the actually relevant terms. IDF extraction picks “authentication,” “login,” and “handler” — the words that discriminate.

When multiple extracted keywords hit the same fact, there’s an additional boost: 1.0 + 0.2 per additional keyword hit. A fact that matches three of your five keywords scores 1.4x compared to one that matches only one. This is a simple but effective proxy for relevance: the more of your distinctive terms a fact contains, the more likely it addresses your actual question.

Session Deduplication

A long coding session might span forty or fifty prompts. Without deduplication, the recall pipeline would inject the same high-scoring project description on every prompt where the keywords overlap. The model already knows this context — it was injected three prompts ago — but the pipeline doesn’t know that unless it tracks state.

The memstored daemon maintains per-session injection tracking. When a fact is included in a recall response, its ID is recorded against the session. Subsequent prompts in the same session skip facts that have already been injected. The context window stays clean, and each prompt gets genuinely new information rather than a rehash of what the model already has.

This was straightforward to implement — a session-keyed set of fact IDs — but the behavioral difference is significant. Without it, a ten-prompt session about authentication would inject the same three auth facts on every prompt, burning context budget on repetition. With it, the first prompt gets the foundational context and subsequent prompts get increasingly specific facts that the earlier injections didn’t cover.

Closing the Feedback Loop

The recall pipeline makes scoring decisions — which facts to boost, which to demote, which to inject. But until recently, there was no signal flowing back from the model about whether those decisions were right. The system generated context; the model consumed it; nobody kept score.

The memory_rate_context tool lets the model rate injected facts as helpful or not. The auto-rating system processes these ratings after sessions end. But for the first several weeks, this feedback was write-only analytics. The ratings were stored but never read back during recall scoring.

Closing that loop required three pieces: a FeedbackScorer interface that queries aggregate feedback scores for a batch of facts, server-side recording of which facts were injected in each session, and a feedback multiplier applied during recall scoring. The multiplier is 1.0 + 0.3 * avg, where avg is the mean feedback score ranging from -1.0 to +1.0. A fact that consistently gets positive feedback scores up to 1.3x; one that consistently gets negative feedback drops to 0.7x.

The effect is gradual and self-correcting. A fact that the model finds unhelpful slowly sinks in the rankings. A fact that proves useful slowly rises. The system learns not just what it knows, but what’s worth surfacing.

The Status Output Incident

A minor war story about operating at scale. The memory_status MCP tool returns a summary of the fact database — counts by category, kind, and subject. When the database held a few hundred facts, this was a compact, useful overview. At 1,488 facts, the output ballooned to 12,900 tokens. The status response was consuming more context than the actual work.

The fix was server-side summarization: categories and kinds sorted by count, synthetic subjects (auto-generated by the learn pipeline) collapsed into a single count, and semantic subjects capped at twenty entries. The output dropped from 12.9k tokens to about thirty lines. This isn’t a deep architectural lesson, but it’s a reminder that anything injected into an LLM context window is a resource with a cost, and what fits at a hundred records won’t fit at a thousand.

The Relevance Floor

Not every fact that matches a keyword is worth injecting. The recall pipeline applies a minimum score ratio of 0.3 — any fact scoring below 30% of the top-scoring result is dropped. This is a simple but necessary filter. Without it, a five-keyword prompt would return five facts matched on a single low-IDF keyword alongside the two or three facts that are actually relevant. The floor ensures that only facts in the same relevance ballpark as the best result make it through.

The overfetch strategy supports this: the pipeline fetches 2x the requested limit from the search index, applies all the scoring boosts and demotions, and then takes the top N after re-ranking. This means the final results are the best of a broader candidate pool, not just the raw search engine output.

Writing This Post with Memstore

There’s a circularity to writing about memstore’s operational lessons using memstore itself. When I started this article, I asked Claude to analyze the draft posts in this blog against the current memstore codebase. Claude searched memstore and pulled back session summaries describing the recall relevance fixes, the feedback loop implementation, the transcript upload race condition, and the status output incident. The specific scoring constants — the 0.2x symbol demotion, the 1.5x decision boost, the 0.3 feedback multiplier — came from reading the source, but the narrative of what happened and why came from memstore’s own records of the development sessions where those changes were made.

This is the value proposition in miniature. Two weeks after implementing the recall relevance fixes, I could reconstruct not just what changed but the reasoning: that symbol facts were drowning out decisions, that the project boost was amplifying noise, that IDF extraction was necessary because raw prompt search matched too broadly. The code shows the current state. The git log shows what changed. Memstore shows why, in the language of the conversations where the decisions were made.

It’s also an honest demonstration of the limits. The session summaries that memstore stored were generated by a local model — and they read like it. They’re accurate but generic, heavy on enumeration and light on narrative. The facts I used as source material for this article needed substantial rewriting to become prose worth reading. Memstore solved the recall problem — finding the right context at the right time — but it doesn’t solve the quality problem. That still requires a better model or a human editor.

For now, it requires both.

Project: github.com/matthewjhunter/memstore

Certifications

Publications