AI agents
Systems that plan, call tools, and act over multiple steps to accomplish a goal. The throughline of my current work at Sourcegraph and across SciX.
Projects
- Agent Diagnostics — A behavioral taxonomy, annotation framework, and shareable dataset backend for analyzing why coding agents succeed or fail on benchmark tasks.
- Agent Tidal Wave — A booth game for the AI World's Fair. Guess how much code AI agents are writing on GitHub, and a wave of agent-written code crashes in.
- Cross-Repo Invariant Verifier — A background agent that checks organization-wide code invariants across every repository indexed by Sourcegraph, triggered by PR events and a weekly cron.
- Code Intelligence Digest — Aggregates feeds and presents curated weekly and monthly digests of code intelligence, tools, and AI agents using hybrid LLM + BM25 + recency scoring.
- CodeProbe — Benchmarks AI coding agents against your own codebase by mining evaluation tasks from its git history, so the suite can't be contaminated by training data.
- CodeScaleBench — A benchmark suite for evaluating how AI coding agents use external context-retrieval tools on realistic developer tasks in large, enterprise-scale codebases.
- EnterpriseBench — A benchmark for evaluating how well coding agents understand and navigate code across large, distributed enterprise codebases.
- Gas City Dashboard — A dashboard for Gas City multi-agent orchestrations.
- Gas City Packs — Reusable packs for Gas City. The PR-pipeline and Slack packs are mine.
- Coding Agent Workflows — Coding standards, agent roles, skills, and multi-step workflows that read the same whether you drive Claude Code, Codex, Amp, or anything that reads an AGENTS.md.
- Gas City — An orchestration-builder SDK for multi-agent coding workflows. I'm a maintainer.
- Livedocs — Keep docs in sync with code. Livedocs extracts structural claims from source into per-repo SQLite databases that AI agents query over MCP, no expensive grep-and-read cycles.
- mem — Build and benchmark agentic memory using a multi-agent orchestrator's own work traces as the evaluation corpus, where every unit of work has a verifiable outcome.
- mcp-ax — An MCP tool agentic-experience evaluation framework, measuring how usable MCP tools actually are for agents.
- SciX Agent — An agentic research assistant over the NASA SciX / ADS corpus, bridging AI agents with scholarly search infrastructure.
- ToM-SWE — A theory-of-mind agent for Claude Code that learns your coding preferences, interaction style, and project patterns across sessions.
Related: Retrieval , Agent memory , Evaluation & benchmarks , Code intelligence
Agent memory
How agents store, retrieve, and forget context across turns and sessions. Memory architectures, retrieval over history, and design tradeoffs.
Projects
- Literature Explorers — Curated, navigable surveys of recent research, organized into thematic maps rather than linear reading lists. Built on SciX MCP and code-intel sources.
- mem — Build and benchmark agentic memory using a multi-agent orchestrator's own work traces as the evaluation corpus, where every unit of work has a verifiable outcome.
- ToM-SWE — A theory-of-mind agent for Claude Code that learns your coding preferences, interaction style, and project patterns across sessions.
Related: AI agents , Retrieval
Code intelligence
Understanding codebases at scale: search, navigation, and agents that reason over source. The domain of my work at Sourcegraph.
Projects
- Cross-Repo Invariant Verifier — A background agent that checks organization-wide code invariants across every repository indexed by Sourcegraph, triggered by PR events and a weekly cron.
- Code Intelligence Digest — Aggregates feeds and presents curated weekly and monthly digests of code intelligence, tools, and AI agents using hybrid LLM + BM25 + recency scoring.
- CodeProbe — Benchmarks AI coding agents against your own codebase by mining evaluation tasks from its git history, so the suite can't be contaminated by training data.
- CodeScaleBench — A benchmark suite for evaluating how AI coding agents use external context-retrieval tools on realistic developer tasks in large, enterprise-scale codebases.
- EnterpriseBench — A benchmark for evaluating how well coding agents understand and navigate code across large, distributed enterprise codebases.
- Gas City Dashboard — A dashboard for Gas City multi-agent orchestrations.
- Gas City Packs — Reusable packs for Gas City. The PR-pipeline and Slack packs are mine.
- Coding Agent Workflows — Coding standards, agent roles, skills, and multi-step workflows that read the same whether you drive Claude Code, Codex, Amp, or anything that reads an AGENTS.md.
- Gas City — An orchestration-builder SDK for multi-agent coding workflows. I'm a maintainer.
- Livedocs — Keep docs in sync with code. Livedocs extracts structural claims from source into per-repo SQLite databases that AI agents query over MCP, no expensive grep-and-read cycles.
- Migration Evals — A tiered-oracle funnel for evaluating automated code migrations end to end, Java 8 to 17, Python 2 to 3, with a pluggable ecosystem.
Related: AI agents , Retrieval , Evaluation & benchmarks
Evaluation & benchmarks
Measuring whether AI systems work. Benchmarks, evals, and honest comparison of search engines and agents.
Projects
- Agent Diagnostics — A behavioral taxonomy, annotation framework, and shareable dataset backend for analyzing why coding agents succeed or fail on benchmark tasks.
- CodeProbe — Benchmarks AI coding agents against your own codebase by mining evaluation tasks from its git history, so the suite can't be contaminated by training data.
- CodeScaleBench — A benchmark suite for evaluating how AI coding agents use external context-retrieval tools on realistic developer tasks in large, enterprise-scale codebases.
- EnterpriseBench — A benchmark for evaluating how well coding agents understand and navigate code across large, distributed enterprise codebases.
- GEO — Generative Engine Optimization — Measuring how LLM-powered tools discover, recommend, and describe products. GEO is the AI equivalent of SEO.
- mem — Build and benchmark agentic memory using a multi-agent orchestrator's own work traces as the evaluation corpus, where every unit of work has a verifiable outcome.
- mcp-ax — An MCP tool agentic-experience evaluation framework, measuring how usable MCP tools actually are for agents.
- Migration Evals — A tiered-oracle funnel for evaluating automated code migrations end to end, Java 8 to 17, Python 2 to 3, with a pluggable ecosystem.
Related: AI agents , Retrieval , Code intelligence
Knowledge graphs
Entity extraction, linking, and graph-structured representations of knowledge. The method behind this site's own projects explorer.
Projects
- Literature Explorers — Curated, navigable surveys of recent research, organized into thematic maps rather than linear reading lists. Built on SciX MCP and code-intel sources.
Related: Retrieval , Scientific search
Scientific search
Discovery over the scholarly literature. NASA ADS / SciX, citation graphs, and bringing agentic and semantic methods to research workflows.
Projects
- Literature Explorers — Curated, navigable surveys of recent research, organized into thematic maps rather than linear reading lists. Built on SciX MCP and code-intel sources.
- NLS Fine-tune (SciX) — Fine-tuning infrastructure for converting natural language into ADS / SciX scientific-literature search queries.
- SciX Agent — An agentic research assistant over the NASA SciX / ADS corpus, bridging AI agents with scholarly search infrastructure.
Related: Knowledge graphs , Retrieval
Retrieval
Finding the right information at the right time. Embeddings, ranking, hybrid search, and retrieval-augmented systems over scientific and code corpora.
Projects
- Code Intelligence Digest — Aggregates feeds and presents curated weekly and monthly digests of code intelligence, tools, and AI agents using hybrid LLM + BM25 + recency scoring.
- CodeScaleBench — A benchmark suite for evaluating how AI coding agents use external context-retrieval tools on realistic developer tasks in large, enterprise-scale codebases.
- Literature Explorers — Curated, navigable surveys of recent research, organized into thematic maps rather than linear reading lists. Built on SciX MCP and code-intel sources.
- GEO — Generative Engine Optimization — Measuring how LLM-powered tools discover, recommend, and describe products. GEO is the AI equivalent of SEO.
- Livedocs — Keep docs in sync with code. Livedocs extracts structural claims from source into per-repo SQLite databases that AI agents query over MCP, no expensive grep-and-read cycles.
- NLS Fine-tune (SciX) — Fine-tuning infrastructure for converting natural language into ADS / SciX scientific-literature search queries.
- SciX Agent — An agentic research assistant over the NASA SciX / ADS corpus, bridging AI agents with scholarly search infrastructure.
Related: AI agents , Agent memory , Code intelligence , Evaluation & benchmarks , Knowledge graphs , Scientific search