Self-Improving Recall: A Feedback Loop for AI Memory
A memory system that ranks facts the same way forever is dead weight. The signal that actually matters — did this fact help, or did it waste context — only exists during a real conversation. Memstore’s feedback loop captures that signal in-session and feeds it back into recall ranking, so the system gets better at surfacing useful knowledge the more it’s used.
The problem: static recall is stale recall
Memstore’s baseline ranking uses static signals — IDF, project boosts, semantic similarity, recency, surface-aware multipliers for project-level facts. All of them are derived from the fact itself, the query, or the static metadata around them. None answers the real question: when memstore injected this fact last time, did it help the agent or did it just crowd out something better?
Without usage feedback, low-signal facts (GoDoc-level symbol summaries, outdated design notes) keep surfacing because they match keywords well, and good facts that don’t match keywords perfectly stay buried. The ranking gets stuck on what looks relevant rather than what was relevant the last hundred times it came up.
The injection record
Every fact surfaced by /v1/recall is written to a context_injections table with session_id, ref_id, ref_type, and the candidate-list rank. The same table powers server-side dedup — a fact injected earlier in a session won’t be re-injected. The record is the foundation: you can’t rate what you didn’t track. The dedup is the immediate benefit, the ranking signal is the longer payoff.
Auto-rating at session end
When the SessionEnd hook fires, memstore queues an extraction job that reads the session transcript alongside the list of facts injected during that session and asks a local model (currently gemma4 — we’ve cycled through a few; the bar is “long enough context to hold the transcript plus the candidate list, smart enough to tell informed-the-answer from injected-and-ignored”) to score each fact on whether it actually helped. The output is a per-fact rating in [-1, +1] written to a context_feedback row.
The job runs asynchronously, and that’s the whole reason it’s batched. The hook hands off one unit of work to memstore’s extract queue and returns immediately, so the next session doesn’t wait for the rating pass to complete. Handing off N per-fact tasks instead would force the queue to serialize through them before the backend could do anything else. The principle is consistent across memstore: anything that touches a model runs on the queue, and the queue is sized so the user never blocks on it.
Why auto rather than manual: the agent is the right judge because it actually saw the injection and either used it or didn’t, and manual rating doesn’t happen in practice — even motivated users won’t sit through “rate these twelve facts” prompts every session. The local-model auto-rater is a worse judge than the user would be on the median fact, but it’s a judge that exists across hundreds of sessions, which is what the ranking signal needs.
Confidence-weighted ranking
A single positive or negative rating shouldn’t swing a well-established fact. The recall path applies feedback as a multiplier whose strength depends on how many sessions have rated the fact:
conf = min(count, 5) / 5
exponent = avg * (0.4 + 0.6 * conf)
multiplier = 2.0 ^ exponent
At one rating, confidence is 0.2 and a +1 average produces a ×1.43 boost (a -1 produces ×0.70) — gentle nudges. At five or more ratings, confidence is full and a ±1 average produces the full ×2.0 or ×0.5. The asymmetry is the point: a fact that’s consistently helpful in many sessions earns a real boost, but a single bad session can’t bury it.
The aggregate itself is just AVG(score) and COUNT(*) over the context_feedback table grouped by ref_id — no temporal decay on the feedback, every historical rating counts equally. That’s a deliberate simplification. Feedback volume per fact is low enough that down-weighting older ratings would mostly just discard signal that’s already sparse.
Backfill and bootstrapping
A backfill-feedback command re-runs the auto-rating pipeline over historical sessions that have injection records but no corresponding feedback rows. It runs on demand and on daemon startup, so any session that landed while the daemon was down gets picked up the next time it comes up. Without backfill, the feedback signal starts empty and takes weeks of organic session activity to become useful; with it, the signal becomes useful as soon as you ingest existing transcripts.
What it catches
- Stale facts: a design decision that’s been superseded but still ranks high for keyword match — gets negatively rated across sessions, demoted
- Cross-project crosstalk: facts from project A leaking into project B sessions — rated as unused, demoted for that project
- Low-signal symbol facts: auto-generated summaries that technically match but don’t help — rated down, eventually fade
What it doesn’t
- Cold start: a new fact has no feedback signal, so it rides on static scoring until it’s been injected enough times to accumulate ratings
- Judgment quality: the local rating model has to be good enough to tell “this fact informed the agent’s answer” from “this fact was injected and ignored”
- Gaming: if the rating model and the generating model share failure modes, feedback can reinforce bad patterns rather than correct them
Connection to the bigger picture
Supersession handles knowledge that changed — the fact itself was revised, and the chain records the history. Feedback handles knowledge that failed — the fact was accurate but unhelpful, and the multiplier records the verdict. Together they let the memory system treat knowledge as a living thing: edit old facts when they’re wrong, deprioritize ones that don’t earn their place in context, preserve history for audit.
Links
- GitHub: github.com/matthewjhunter/memstore
- Flagship post: Building Persistent Memory for AI Agents
- Hooks post: Proactive Context Injection with Claude Code Hooks
- Supersession post: Fact Supersession: Version Control for Knowledge

