Open thread
exploringHow do we evaluate coding agents honestly, at scale?
A benchmark that looks impressive and measures nothing is worse than no benchmark, because now the number carries authority it never earned. Most coding-agent evaluations still test toy tasks, report one pass rate, and get cited as if they settled the question. The benchmarks I build target large, real software changes, and the writing next to them is mostly a long argument with the popular ones. Below is that work and the reading that keeps me honest about what good is supposed to mean.
The work
- CodeScaleBench Project
- EnterpriseBench Project
- Migration Evals Project
- Agent Diagnostics Project
- Rethinking coding agent benchmarks Writing
- I couldn't find a good enough benchmark for large-scale software development, so I built one Writing
- Why your coding agent keeps failing in ways you can't predict Writing
Reading path
A generated path through 4 papers — assembled using her SciX literature tools (semantic embeddings, citation graphs, and reading-order signals).
-
A Survey on Large Language Models for Code Generation ↗
The lay of the land: how code-generation models are surveyed and measured.
-
Calibration and Correctness of Language Models for Code ↗
Calibration: whether the models know when they are right, an honesty signal most benchmarks ignore.
-
LLM-Assisted Code Cleaning For Training Accurate Code Generators ↗
What goes into training shapes what the benchmark sees: data quality for code generators.
-
Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey ↗
Moving past static pass@k toward outcome-based RL signals.
Open literature
- A Survey on Large Language Models for Code Generation ↗
- Calibration and Correctness of Language Models for Code ↗
- LLM-Assisted Code Cleaning For Training Accurate Code Generators ↗
- Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey ↗
- Large Language Model-Based Agents for Software Engineering: A Survey ↗
From my ADS library
- SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? ↗
- SWE-Bench+: Enhanced Coding Benchmark for LLMs ↗
- Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL ↗
- Long Code Arena: a Set of Benchmarks for Long-Context Code Models ↗
- Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks ↗
- Deep Research: A Systematic Survey ↗
Pulled from her curated ADS library.
From my ADS library
- AlphaEvolve: A coding agent for scientific and algorithmic discovery ↗
- Memory in the Age of AI Agents ↗
- Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL ↗
- Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs ↗
- LocAgent: Graph-Guided LLM Agents for Code Localization ↗
- Issue Localization via LLM-Driven Iterative Code Graph Searching ↗
Pulled from her curated ADS library.