How do we evaluate coding agents honestly, at scale?

A benchmark that looks impressive and measures nothing is worse than no benchmark, because now the number carries authority it never earned. Most coding-agent evaluations still test toy tasks, report one pass rate, and get cited as if they settled the question. The benchmarks I build target large, real software changes, and the writing next to them is mostly a long argument with the popular ones. Below is that work and the reading that keeps me honest about what good is supposed to mean.

evaluationagentscode-intelligence

The work

CodeScaleBench Project
EnterpriseBench Project
Migration Evals Project
Agent Diagnostics Project
Rethinking coding agent benchmarks Writing
I couldn't find a good enough benchmark for large-scale software development, so I built one Writing
Why your coding agent keeps failing in ways you can't predict Writing

Reading path

A generated path through 4 papers — assembled using her SciX literature tools (semantic embeddings, citation graphs, and reading-order signals).

A Survey on Large Language Models for Code Generation ↗

2024 · 155 cites

The lay of the land: how code-generation models are surveyed and measured.
Calibration and Correctness of Language Models for Code ↗

2024 · 28 cites

Calibration: whether the models know when they are right, an honesty signal most benchmarks ignore.
LLM-Assisted Code Cleaning For Training Accurate Code Generators ↗

2023 · 18 cites

What goes into training shapes what the benchmark sees: data quality for code generators.
Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey ↗

2024 · 7 cites

Moving past static pass@k toward outcome-based RL signals.

Open literature

A Survey on Large Language Models for Code Generation ↗ 2024 · 155 cites
Calibration and Correctness of Language Models for Code ↗ 2024 · 28 cites
LLM-Assisted Code Cleaning For Training Accurate Code Generators ↗ 2023 · 18 cites
Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey ↗ 2024 · 7 cites
Large Language Model-Based Agents for Software Engineering: A Survey ↗ 2024 · 40 cites

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? ↗ Miserendino, Samuel et al. · 2025 · 69 cites
SWE-Bench+: Enhanced Coding Benchmark for LLMs ↗ Aleithan, Reem et al. · 2024 · 65 cites
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL ↗ Li, Weizhen et al. · 2025 · 55 cites
Long Code Arena: a Set of Benchmarks for Long-Context Code Models ↗ Bogomolov, Egor et al. · 2024 · 55 cites
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks ↗ Cao, Yixin et al. · 2025 · 37 cites
Deep Research: A Systematic Survey ↗ Shi, Zhengliang et al. · 2025 · 33 cites

Pulled from her curated ADS library.

AlphaEvolve: A coding agent for scientific and algorithmic discovery ↗ Novikov, Alexander et al. · 2025 · 465 cites
Memory in the Age of AI Agents ↗ Hu, Yuyang et al. · 2025 · 136 cites
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL ↗ Gao, Jiaxuan et al. · 2025 · 95 cites
Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs ↗ Yang, Dayu et al. · 2025 · 33 cites
LocAgent: Graph-Guided LLM Agents for Code Localization ↗ Chen, Zhaoling et al. · 2025 · 31 cites
Issue Localization via LLM-Driven Iterative Code Graph Searching ↗ Jiang, Zhonghao et al. · 2025 · 21 cites

Pulled from her curated ADS library.

How do we evaluate coding agents honestly, at scale?

The work

Reading path

A Survey on Large Language Models for Code Generation ↗

Calibration and Correctness of Language Models for Code ↗

LLM-Assisted Code Cleaning For Training Accurate Code Generators ↗

Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey ↗

Open literature

From my ADS library

From my ADS library

Explore the links