Essay

Two retrieval systems write this site

Jun 2026

The most recent content commit to this site’s repo before this post was a machine’s: a daily digest for June 10th, one markdown file with eleven items in its frontmatter and a 581-second MP3 beside it under public/media/digests/. A headless Claude session queried a Postgres database of scored feed items over MCP, wrote a newsletter and a podcast transcript, rendered the audio through OpenAI’s TTS endpoint, and committed the result locally. My contribution was reading the diff and running git push.

That split is the design of the whole site. The resources here, the curated paper libraries and thematic explorers, the daily and weekly digests with their podcast audio, are produced upstream by two systems: a literature engine over the full NASA ADS corpus that feeds the Library, and a feed-triage pipeline running on a free-tier container that feeds the Digest. Both are exposed to agents as MCP servers; the site itself is a static Astro build with no database and no server-side code, the place where their output lands after a human has looked at it.

Everything on the site publishes as a diff

Everything here is markdown and JSON checked into git. Astro content collections with Zod schemas define what a digest issue, an essay, a talk, or a paper library is allowed to look like; astro build turns the lot into static HTML; there is nothing to operate. The Library’s interactive explorers, the knowledge graph, the digest archive with its audio players: all of it is computed at build time from JSON files sitting in src/data/.

The simplicity is deliberate. When agents write content, the publication layer needs to double as a review gate, and a static site in git gets that for free, because every generated newsletter, every paper synthesis, every transcript arrives as a diff that can be read, amended, or thrown away before anything goes public. The pipelines commit; only a human pushes. There is no CMS to secure and no admin panel to forget about, which, as the next two sections show, is not a hypothetical class of problem.

The Library sits on a 32-million-paper engine

The Library section publishes curated NASA ADS paper libraries and thematic explorers: reading paths through a topic, with a per-paper synthesis. The machinery behind them is SciX, a retrieval platform I run on a single workstation (an RTX 5090 box with 1.9 TB of NVMe): the full ADS corpus, 32.4 million papers spanning 1800 to 2026, in one PostgreSQL 16 instance with pgvector, alongside 299.3 million citation edges and full-text bodies for 14.9 million of the papers. Dense embeddings (INDUS, 768 dimensions, stored as fp16 halfvec) cover the entire corpus. Retrieval is hybrid: a lexical lane over titles and abstracts, a BM25 lane over bodies, and a dense HNSW lane, fused with reciprocal rank fusion at k=60, with per-lane timings on every result. The project has paid its tuition in operational lessons along the way; a Postgres restart once truncated an UNLOGGED embeddings table, 32 million vectors gone, a bulk-write optimization now banned in migration comments.

Agents reach it through an MCP server with 13 tools, and that number is itself a decision: an earlier surface had 28, and the consolidation came out of a premortem finding that agent tool-selection accuracy degrades past about 15 tools. The evals have been bracing in other ways. Cross-encoder reranking, the step every retrieval tutorial says to add, measured worse on the gold set (−0.045 to −0.056 nDCG@10) and ships disabled; the hybrid fusion itself scores marginal-to-negative against dense-only on the current ground truth, which is citation-based and team-authored and therefore suspect, so a judge-validation program exists to replace it before any of those negative results get treated as final.

What the website takes from all this is mostly the synthesis layer. Each explorer paper gets a four-part write-up, plain-language abstract, motivation, methodology, and results, generated from the paper’s actual full text fetched section by section through the MCP read_paper tool, never from the abstract alone. 305 of the 318 explorer papers have one; the remainder are vendor blog posts and standards documents with no full text to ground on, and those keep shorter notes rather than getting a synthesis invented from thin air. The standing rule that a field stays empty when the text doesn’t support it has mattered more than any prompt detail.

The Digest sits on a pipeline that fits in 512 MB

The Digest section, daily and weekly issues with newsletters and podcast audio, comes out of a different system: a content-intelligence pipeline deployed on Render’s free tier, where ingestion, scoring, embeddings, and agents all have to fit a 512 MB container with the Node heap capped at 460 MB.

Inoreader feeds and NASA ADS flow in on an hourly cron, resumable via continuation tokens and tracked against a daily API budget (the sync uses roughly 24–48 of its 1,000 allowed calls per day). Each item is normalized, categorized into one of about nine categories, and scored with a hybrid formula: an LLM relevance-and-usefulness judgment blended with BM25 over domain vocabulary, multiplied by watchlist and topic boosts. Everything is batched to respect the container, eight full-text fetches per run, 100 embeddings, 40 LLM scores per category, so full text sometimes arrives a day late, an accepted cost. The design bias throughout is graceful degradation: Anthropic falls back to OpenAI, one web-search provider falls back to another, missing embeddings fall back to pseudo-vectors. The system’s most instructive failure came from exactly that bias. Every fallback was individually defensible and collectively silent, so a quality regression had no alarm attached anywhere; a hardening pass this month put warnings on the quiet paths, and the same audit found expensive admin endpoints sitting on the public route list, which is the kind of thing you find when you start treating a personal tool as production.

The pipeline is exposed as an MCP server too, with search, semantic search, and aggregation over the scored items, and that surface is what the publishing step consumes.

Publishing is a headless agent run, and git is the gate

The seam between those systems and this site is scripts/digest/run.sh, which invokes claude -p, a non-interactive Claude Code session, with a prompt template and the digest MCP tools on the allowlist. The prompt does the editorial work: search the last day or week of scored items, pick what clears the bar, write a newsletter and a podcast transcript grounded in the items’ full text. The same template embeds the site’s writing-voice rules with a mandatory revision pass against a catalog of AI-prose tells, which is also the process this post went through; whether it worked is the reader’s call.

From there it is plumbing. tts-render.mjs chunks the transcript through OpenAI TTS and concatenates the audio, publish-digest.mjs validates a spec JSON and writes the content-collection entry plus the MP3, and the run commits locally without pushing. A curated variant exists for hand-picked issues: items tagged in the digest UI go out through a send-to-website button that writes a handoff JSON and spawns the same pipeline constrained to exactly those items. That button-to-subprocess path got its own security pass, since “web route spawns git push” is a sentence that should make anyone nervous: every field that reaches the subprocess is now schema-validated, path containment uses path.relative instead of a prefix check that would accept sibling directories, and the route is blocked outright in production, where the website repo does not exist anyway.

One retrieval shape recurs at three scales

Step back and the stack is the same pattern at three orders of magnitude. SciX fuses lexical and dense lanes with RRF over 32 million papers and worries about six-hour index rebuilds and lock chains. The digest blends BM25 with LLM scores over tens of thousands of feed items and worries about heap. And the site’s own build step, src/lib/knowledge/build.ts, runs semantic, lexical, and graph lanes, fused with RRF, over a few hundred nodes of projects, talks, and papers to compute the related-content links, at a scale where the embeddings live in a JSON file and the whole computation happens inside astro build. Hybrid retrieval plus rank fusion turns out to be the thing I reach for at every scale; what changes is purely operational, whether an index rebuild takes six hours or six milliseconds.

The open problems are mostly about evaluation catching up with throughput. The SciX gold set needs replacing with pooled, human-anchored judgments before its negative results can be trusted; the digest has no output-quality evals at all yet, so a bad model day is currently indistinguishable from a good one; and degraded runs need to become queryable in the outputs rather than visible only in logs. The site can now publish faster than I can verify, which is an odd place for a personal website to be, and closing that gap is the next piece of work.

← All writing