The Layers That Didn't Hold
A few weeks ago I wrote that defense in depth for AI agents means layers, not walls: screen untrusted content before the model acts on it, sanitize what comes back out, and never trust the data flowing through. Clean theory. Then I went back and read the code in Herald that was supposed to implement those layers.
Several of them didn’t hold.
Herald is my feed reader. It pulls RSS and Atom from across the internet, runs each article through a local security model before anything else touches it, scores the survivors for relevance, and announces the interesting ones. Every feed item is untrusted content aimed at a model. That’s the whole premise of the defense-in-depth piece, and it’s exactly the threat I built Herald to study. What follows is the v0.2.0 hardening pass – the bugs the theory missed, and a couple of ideas that worked.
A word on where these bugs came from, because they share a cause. Herald runs entirely on local models – no frontier APIs – and between 0.1.0 and 0.2.0 I churned that code hard: a lot of models tested across several machines, each one handled a little differently by the security screen, the grouping pass, and the summarization pass. Every model fails its own way, and chasing those differences reorganized all three pathways more than once. The standing tension was whether to switch models per task and process articles in parallel to go faster, or run one model at a time in focused stages to hold memory down and dodge the GPU contention that forces a hot reload between requests. That churn landed squarely on the code paths where the worst bugs in this post turned up. Churn breeds mistakes.
A fence the feed could climb over
Start with the layer that was most obviously wrong, because it’s the one most people get wrong.
To screen feed text, you wrap it in a delimiter and hand the model a prompt: “here is some untrusted article content, between these markers, tell me if it’s trying to manipulate you.” Herald’s marker was a static <article> tag, and the content was interpolated verbatim.
A static delimiter is not a fence. It’s a suggestion. A hostile feed embeds a literal </article> in its own body, closes the fence from the inside, and everything after it lands in the trusted region of the prompt – the part the model reads as instructions rather than data. The title wasn’t fenced at all.
The fix is a per-call random nonce. internal/ai/fence.go generates 16 bytes from crypto/rand for every single prompt invocation and builds the delimiter as <untrusted-{nonce}>. The content can’t close a tag it can’t predict.
That alone would mostly do it. But “mostly” is not the standard I wanted, so there’s a second layer that doesn’t depend on the nonce staying secret:
var fenceTagRe = regexp.MustCompile(`(?i)</?(?:untrusted|article)(?:-[0-9a-f]+)?\s*>`)
func neutralizeFence(s string) string {
return fenceTagRe.ReplaceAllString(s, "[tag removed]")
}
Any delimiter-shaped sequence is stripped out of the untrusted text before it goes anywhere near the template. A leaked nonce can’t help the attacker, and a legacy custom prompt still using the old <article> fence is covered too.
The interesting part is what the regex doesn’t match. It deliberately ignores tags carrying attributes, so <article class="post"> survives untouched. That’s the point: I want the model to see the real HTML markup in a feed so it can judge whether the content is hostile. Stripping all tags would blind the security check to exactly the structure it’s there to evaluate. The neutralizer removes the two bare tags that could break the fence and leaves everything else for the model to inspect.
And it applies to all six prompts that embed feed-derived text – security, curation, summarization, group summary, newsletter, related groups – including the title that had no fence at all. A defense you apply to three of six paths is a defense you don’t have.
The security layer was a comment
This one is worse, and it’s the kind of bug that only shows up when you read the code instead of the diagram.
The architecture was right on paper. Security check first; if it blocks or flags the article, nothing downstream runs. There was even a comment, on the shared content-length constant, asserting that the screen gated the models behind it.
The comment was true. The code was not.
processArticlesForUser ran the security check and the summarizer concurrently, in the same errgroup. So the summarizer saw unscreened content regardless of the verdict, and happily generated and cached a summary for articles the screen was, at that same moment, deciding to block. The gate was drawn in the comment and nowhere in the control flow. The model that was supposed to be protected by the screen never waited for it.
The fix pulled the per-article logic into its own function, screenAndScoreArticle, behind narrow interfaces so the ordering is actually testable. Security runs first and alone. A failure, a hard block, or a medium flag returns before any summarizer or curator call – no summary generated, none cached. Only an article that clears the full threshold gets summarized and curated, and those two run concurrently with each other, because by then they’re both operating on screened content.
The tests now assert the thing the comment used to assert: blocked, flagged, and errored articles never reach SummarizeArticle or CurateArticle. If you have a security layer, the order of operations is the security property. Concurrency that ignores the order doesn’t have the layer – it has a comment about one.
Sanitize at the source, not at the view
Herald speaks the Fever API
, so phone clients like Reeder can sync against it. The web view ran feed HTML through a bluemonday
sanitizer policy. The Fever API emitted the same article content into its html field with nothing but the ingest-time null-byte strip.
Fever clients render that field as HTML. So a hostile feed’s stored <script> or onerror= markup sailed straight through Herald to the client – a stored XSS, delivered by the feed reader that was supposed to be the thing protecting the reader. The sanitizer existed; it just sat in front of one of the two doors.
The fix is the obvious one once you see it: sanitization belongs at a single package-level chokepoint, sanitizeHTML(), that every output path calls – web view, newsletter, and the Fever html field. Not a policy lazily initialized inside one HTTP handler.
That lazy init() had a second problem hiding behind the first. The Fever path never called it, so the sanitizer policy was nil for any Fever-only client that hit the server before a browser ever loaded a page. The same change that closed the XSS closed a latent nil-pointer panic, because moving the policy to a package-level value made “is it initialized” a question that couldn’t be answered wrong. Sanitize where the data is produced, once, not at each place it happens to be rendered.
Telling a transient failure from a terminal one
The other half of v0.2.0 wasn’t injection. It was a category error the pipeline kept making about its own failures, and it cost real data.
When an article fails AI processing, there are two completely different reasons, and they demand opposite responses. Either the content can’t be processed – too short to embed, no body, genuinely malformed – and you should stop trying, forever. Or the backend hiccuped – Ollama timed out, the circuit breaker was open, a transient 500 – and you should absolutely try again later. Treat the first like the second and you hot-loop forever on garbage. Treat the second like the first and you permanently discard good articles because the model server blinked.
Herald did the second one, in two places, and I have the numbers.
The scoring path tracked an ai_retries counter and dropped an article from the queue after three failures. Sensible – except it incremented on any error, including a circuit-breaker-open rejection that never even reached the model. A one-hour Ollama outage failed every in-flight article fast, burned all three retries in a few cycles without a single genuine scoring attempt, and orphaned them. Eight articles died that way in one outage; 553 sat stranded at the retry cap. The embedding path was the same shape and worse: every failed embed wrote one identical []byte{0} sentinel, and the backfill read every sentinel as “handled, skip forever.” A saturation burst on the morning of May 9th sentineled 15,062 articles that should have had perfectly good embeddings. Permanent skip records, every one, that no future cycle would ever revisit.
The fix is to make the pipeline say which kind of failure it had. A new ai.ErrBackendUnavailable marks every path where the backend produced no verdict – breaker open, transport error, non-200. The screening path skips the retry increment for it, so a transient outage leaves the article eligible instead of orphaned; a genuinely unparseable model response still counts toward the give-up cap, because that one really is the content’s fault. The embedding sentinels grew a status and an attempts column: too-short content is a terminal skip, an error is retried while attempts remain, and the migration walks the 15,062 legacy sentinels and reclassifies them by body length so the good ones come back.
A retry policy that can’t distinguish “the content is bad” from “the server is down” isn’t a retry policy. It’s a coin flip you’ve decided to trust.
The security alarm that was really a token budget
My favorite bug in the release, because it lives right where the reliability work meets the security work.
Newer reasoning-style models – Gemma 3, Qwen 3 – spend 1500 to 2000 output tokens on an internal chain-of-thought trace before they emit a single character of actual answer. Herald sent its security prompts with no explicit max_tokens, so the server’s default of roughly 400 got entirely consumed by the model thinking, and the response came back empty.
Here’s what Herald did with an empty response: it labeled it “Security response did not match expected JSON format – possible prompt injection” and rejected the article.
Read that again. The security layer was raising a prompt-injection alarm on perfectly benign content, because its own request was truncated before the model could answer. A false positive in the screen is not a harmless over-correction – it’s the layer crying wolf, and a layer you learn to ignore is a layer you’ve removed. The fix is boring: send max_tokens=2048, which leaves room for the reasoning plus a compact JSON object, and give the empty-response case its own distinct error reason so a future truncation never again gets filed under “injection.” The boring fix is the right one. The instructive part is that a security control failed by being too loud, and the cause was three layers away in a default I’d never set.
Two ideas that worked
Not everything in the release was me finding holes in my own walls.
Search went hybrid. Keyword search over SQLite FTS5
with BM25 ranking is fast and precise and finds documents that contain your terms; it’s also blind to anything phrased differently than you searched. Semantic search over article embeddings finds meaning and misses exact strings. Herald now runs both in parallel, merges and deduplicates, and tags each hit as fts, semantic, or both. The piece I like is the cost: the scoring pipeline already generates an embedding for every article to do group clustering, so persisting that same vector for search adds zero additional model calls. The embedding was already paid for; it was just being thrown away.
The embeddings themselves got smarter about what they encode. Instead of vectorizing title-plus-body, Herald now feeds the embedder labeled metadata – feed title, author, categories – alongside the content. The source becomes part of the vector. An article from Schneier on Security clusters differently from a same-topic article off Hacker News, because the byline is a feature now, not noise. It also invalidated every embedding I had, which is why the reclassification migration above had a busy week.
What’s still soft
The honest bounds, because a hardening writeup that claims it’s finished is lying.
The screen is still a small local model, and a small model can still be talked past. But notice how weak the adversary actually is here. The jailbreaks you read about are a determined human in an interactive session with unlimited retries, watching the model’s answers and adapting. A feed item is a static page. It gets one blind shot against a model whose entire authority is to return suspicious or safe and write a summary – no tool to hijack, no session to steer, no second turn. Prompt injection is dangerous in proportion to what the model can do, and this one can do almost nothing. The nonce closes the fence breakout completely; what’s left is an article that argues persuasively for a verdict that gets logged. That’s a smaller problem than the literature makes it sound, because the literature is mostly about agents with real powers.
Sanitization is only as good as the bluemonday policy, and HTML sanitizers have a long history of bypasses. Keep it in perspective, though: sanitized HTML is safer than raw, but every one of us browses the unsanitized web all day long. It’s usually fine, until it isn’t. Stripping <script> from a stored article is a real improvement over not doing it, not a wall between the reader and catastrophe. The retry classification is correct now, but “is this error transient” is a heuristic, and heuristics have edges.
What changed in v0.2.0 isn’t that Herald is secure. It’s that the layers I claimed to have actually run, in the order I claimed, on every path instead of most of them. A lower bar than “secure,” and a much higher one than where I started.
Which leaves two principles, in the order of how much they actually buy you.
The first and most fundamental: don’t let an agentic AI read untrusted content outside of a tightly limited sandbox. Every defense in this post is a consolation prize for having exposed a model to hostile input at all. Herald gets away with it because its agent is nearly powerless – it classifies and summarizes and can touch nothing else. The moment you hand the thing reading your feeds a shell, or your email, or your calendar, the screening model stops being a nice-to-have and the sandbox stops being optional.
The second, which this whole post is one long demonstration of: don’t assume your defenses work. A layer is only a layer if it executes – not if it’s described in a comment, applied to three paths out of six, or sitting in front of one of two doors. I had a fence a feed could climb over, a gate that lived in a comment, a sanitizer guarding one of two doors, and a security alarm that fired on its own truncated requests, and every one of them looked correct in the diagram. The diagram was right the whole time. The code is what I had to read.

