Working knowledge

Open threads

Each thread is a question I'm actively working on, with the writing, projects, talks, and papers that orbit it gathered in one place; the question is the subject, not me. Relationships between those pieces are computed three ways, by meaning (embeddings), by words (lexical), and by citation, the same retrieval lenses I build at SciX and Sourcegraph.

exploring

Can scientific literature be made genuinely navigable — not just searchable?

A ranked list answers one question: which documents match these keywords. It says nothing about how a field is shaped, which results build on which, where two communities are quietly disagreeing, or which corner nobody has looked at yet. The work at SciX has been about putting that structure on top of the ADS corpus, using embeddings for meaning, citation graphs for how the ideas connect, and controlled vocabularies to keep the grounding honest. What sits below is the current state of that, and the papers I read to push on it.

retrievalembeddingsknowledge-graphsscientific-search

5 artifacts · 5 papers · 5-step reading path

Follow the thread →
exploring

What makes multi-agent systems reliable enough to change real production code?

Most agent demos work once. Production is the opposite problem: the same task run a thousand times, with no one watching the single run that quietly corrupts a repository. The work at Sourcegraph lives underneath that, in orchestration, verification, and blast-radius control rather than in cleverer prompts. Below is what I'm building toward reliability, sitting next to the empirical work on where agents actually break.

agentscode-intelligenceevaluation

8 artifacts · 5 papers · 5-step reading path

Follow the thread →
open

How should agents remember across long horizons?

An agent with no memory re-derives the world every turn; an agent that remembers the wrong things compounds its own mistakes instead. Most of the open work sits between those two failures: what to store, when to consolidate, and what to let the system forget. I mapped that literature into an explorer of 108 papers across nine themes and a five-part podcast, so the ideas stay learnable rather than only citable. The papers below are two recent threads I'm still pulling on.

agent-memoryagentsretrieval

5 artifacts · 2 papers · 2-step reading path

Follow the thread →
exploring

How do we evaluate coding agents honestly, at scale?

A benchmark that looks impressive and measures nothing is worse than no benchmark, because now the number carries authority it never earned. Most coding-agent evaluations still test toy tasks, report one pass rate, and get cited as if they settled the question. The benchmarks I build target large, real software changes, and the writing next to them is mostly a long argument with the popular ones. Below is that work and the reading that keeps me honest about what good is supposed to mean.

evaluationagentscode-intelligence

7 artifacts · 5 papers · 4-step reading path

Follow the thread →

See the links three ways

Pick anything — a paper, an essay, a project — and watch how "related" changes depending on whether you ask by meaning, by words, or by citations.

Can scientific literature be made genuinely navigable — not just searchable?

What makes multi-agent systems reliable enough to change real production code?

How should agents remember across long horizons?

How do we evaluate coding agents honestly, at scale?

See the links three ways