Project
EnterpriseBench
A benchmark for evaluating how well coding agents understand and navigate code across large, distributed enterprise codebases.
PythonEvaluationAgents
Existing benchmarks test agents in isolated, single-repo settings. EnterpriseBench tests what enterprise developers actually do: tracing dependencies across repositories, investigating incidents that span services, and producing diverse artifacts. 112 tasks across 10 task types and 7 enterprise workflow suites, over real OSS codebases, more than half spanning multiple repos.