Projects explorer

A map of the work

Projects, the topics they touch, and the outputs they produce, drawn as a graph. Knowledge graphs are how I think about information, so here is mine. Click a node for detail, filter by type, or read the structured list below.

AI agents

Systems that plan, call tools, and act over multiple steps to accomplish a goal. The throughline of my current work at Sourcegraph and across SciX.

Projects

Agent Diagnostics — A behavioral taxonomy, annotation framework, and shareable dataset backend for analyzing why coding agents succeed or fail on benchmark tasks.
Agent Oriented Architecture Toolkit — Measures whether a repository actually works for AI coding agents by running an agent against tasks mined from its own git history and scoring what it did, instead of checking for the presence of files like AGENTS.md.
Agent Tidal Wave — A booth game for the AI World's Fair. Guess how much code AI agents are writing on GitHub, and a wave of agent-written code crashes in.
Cross-Repo Invariant Verifier — A background agent that checks organization-wide code invariants across every repository indexed by Sourcegraph, triggered by PR events and a weekly cron.
Code Intelligence Digest — Aggregates feeds and presents curated weekly and monthly digests of code intelligence, tools, and AI agents using hybrid LLM + BM25 + recency scoring.
CodeProbe — Benchmarks AI coding agents against your own codebase by mining evaluation tasks from its git history, so the suite can't be contaminated by training data.
CodeScaleBench — A benchmark suite for evaluating how AI coding agents use external context-retrieval tools on realistic developer tasks in large, enterprise-scale codebases.
EnterpriseBench — A benchmark for evaluating how well coding agents understand and navigate code across large, distributed enterprise codebases.
Coding Agent Workflows — Coding standards, agent roles, skills, and multi-step workflows that read the same whether you drive Claude Code, Codex, Amp, or anything that reads an AGENTS.md.
Gas City Dashboard — A dashboard for Gas City multi-agent orchestrations.
Gas City Packs — Reusable packs for Gas City. The PR-pipeline and Slack packs are mine.
Gas City — An orchestration-builder SDK for multi-agent coding workflows. I'm a maintainer.
Livedocs — Keep docs in sync with code. Livedocs extracts structural claims from source into per-repo SQLite databases that AI agents query over MCP, no expensive grep-and-read cycles.
mem — Build and benchmark agentic memory using a multi-agent orchestrator's own work traces as the evaluation corpus, where every unit of work carries a real lifecycle outcome and a full trace.
mcp-ax — An MCP tool agentic-experience evaluation framework, measuring how usable MCP tools actually are for agents.
Personal Website — sjarmak.ai: an Astro static site whose content collections form a typed knowledge graph. This very project entry is one node in it.
SciX Agent — An agentic research assistant over the NASA SciX / ADS corpus, bridging AI agents with scholarly search infrastructure.
Sourcegraph GTM Assistant — A stateless MCP server on Cloud Run that gives any authenticated Sourcegraph employee, through claude.ai, one tool surface over curated per-account research (GCS corpus) and live internal data (Salesforce, Looker, PostHog, HubSpot via cost-safeguarded databot), spanning account discovery, intelligence, lead scoring, and voice-checked outreach drafting.
ToM-SWE — A theory-of-mind agent for Claude Code that learns your coding preferences, interaction style, and project patterns across sessions.

Outputs

Building a Software Factory (webinar) webinar

Code intelligence

Understanding codebases at scale: search, navigation, and agents that reason over source. The domain of my work at Sourcegraph.

Projects

Agent Oriented Architecture Toolkit — Measures whether a repository actually works for AI coding agents by running an agent against tasks mined from its own git history and scoring what it did, instead of checking for the presence of files like AGENTS.md.
Cross-Repo Invariant Verifier — A background agent that checks organization-wide code invariants across every repository indexed by Sourcegraph, triggered by PR events and a weekly cron.
Code Intelligence Digest — Aggregates feeds and presents curated weekly and monthly digests of code intelligence, tools, and AI agents using hybrid LLM + BM25 + recency scoring.
CodeProbe — Benchmarks AI coding agents against your own codebase by mining evaluation tasks from its git history, so the suite can't be contaminated by training data.
CodeScaleBench — A benchmark suite for evaluating how AI coding agents use external context-retrieval tools on realistic developer tasks in large, enterprise-scale codebases.
EnterpriseBench — A benchmark for evaluating how well coding agents understand and navigate code across large, distributed enterprise codebases.
Coding Agent Workflows — Coding standards, agent roles, skills, and multi-step workflows that read the same whether you drive Claude Code, Codex, Amp, or anything that reads an AGENTS.md.
Gas City Dashboard — A dashboard for Gas City multi-agent orchestrations.
Gas City Packs — Reusable packs for Gas City. The PR-pipeline and Slack packs are mine.
Gas City — An orchestration-builder SDK for multi-agent coding workflows. I'm a maintainer.
Livedocs — Keep docs in sync with code. Livedocs extracts structural claims from source into per-repo SQLite databases that AI agents query over MCP, no expensive grep-and-read cycles.
Migration Evals — A tiered-oracle funnel for evaluating automated code migrations end to end, Java 8 to 17, Python 2 to 3, with a pluggable ecosystem.

Outputs

Building a Software Factory (webinar) webinar

Evaluation & benchmarks

Measuring whether AI systems work. Evals and benchmarks of agents and agentic retrieval.

Projects

Agent Diagnostics — A behavioral taxonomy, annotation framework, and shareable dataset backend for analyzing why coding agents succeed or fail on benchmark tasks.
Agent Oriented Architecture Toolkit — Measures whether a repository actually works for AI coding agents by running an agent against tasks mined from its own git history and scoring what it did, instead of checking for the presence of files like AGENTS.md.
CodeProbe — Benchmarks AI coding agents against your own codebase by mining evaluation tasks from its git history, so the suite can't be contaminated by training data.
CodeScaleBench — A benchmark suite for evaluating how AI coding agents use external context-retrieval tools on realistic developer tasks in large, enterprise-scale codebases.
EnterpriseBench — A benchmark for evaluating how well coding agents understand and navigate code across large, distributed enterprise codebases.
GEO: Generative Engine Optimization — Measuring how LLM-powered tools discover, recommend, and describe products. GEO is the AI equivalent of SEO.
mem — Build and benchmark agentic memory using a multi-agent orchestrator's own work traces as the evaluation corpus, where every unit of work carries a real lifecycle outcome and a full trace.
mcp-ax — An MCP tool agentic-experience evaluation framework, measuring how usable MCP tools actually are for agents.
Migration Evals — A tiered-oracle funnel for evaluating automated code migrations end to end, Java 8 to 17, Python 2 to 3, with a pluggable ecosystem.

Retrieval

Finding the right information at the right time. Embeddings, ranking, hybrid search, and retrieval-augmented systems over scientific and code corpora.

Projects

Code Intelligence Digest — Aggregates feeds and presents curated weekly and monthly digests of code intelligence, tools, and AI agents using hybrid LLM + BM25 + recency scoring.
CodeScaleBench — A benchmark suite for evaluating how AI coding agents use external context-retrieval tools on realistic developer tasks in large, enterprise-scale codebases.
GEO: Generative Engine Optimization — Measuring how LLM-powered tools discover, recommend, and describe products. GEO is the AI equivalent of SEO.
Livedocs — Keep docs in sync with code. Livedocs extracts structural claims from source into per-repo SQLite databases that AI agents query over MCP, no expensive grep-and-read cycles.
NLS Fine-tune (SciX) — Fine-tuning infrastructure for converting natural language into ADS / SciX scientific-literature search queries.
Personal Website — sjarmak.ai: an Astro static site whose content collections form a typed knowledge graph. This very project entry is one node in it.
SciX Agent — An agentic research assistant over the NASA SciX / ADS corpus, bridging AI agents with scholarly search infrastructure.