Project
Agent Diagnostics
A behavioral taxonomy, annotation framework, and shareable dataset backend for analyzing why coding agents succeed or fail on benchmark tasks.
PythonEvaluationAgents
Pass/fail scores hide reward hacking, flawed tests, and lucky patches. Agent Diagnostics extracts structured signals from agent trajectories and classifies failure modes (40 categories across 11 dimensions, over ~12k trials, 4 models, 61 benchmarks) into a queryable dataset so you can understand what actually happened.