Project
CodeScaleBench
A benchmark suite for evaluating how AI coding agents use external context-retrieval tools on realistic developer tasks in large, enterprise-scale codebases.
C++EvaluationRetrieval
275 tasks across 20 suites covering 9 developer work types, with versioned suites, dual-verifier support, auditable result snapshots (per-task traces, scores, timing, cost), and indexes by complexity, language, repo size, and multi-repo scope.