Open thread

exploring

How do we evaluate coding agents honestly, at scale?

A benchmark that looks impressive and measures nothing is worse than no benchmark, because now the number carries authority it never earned. Most coding-agent evaluations still test toy tasks, report one pass rate, and get cited as if they settled the question. The benchmarks I build target large, real software changes, and the writing next to them is mostly a long argument with the popular ones. Below is that work and the reading that keeps me honest about what good is supposed to mean.

evaluationagentscode-intelligence

The work

Reading path

A generated path through 4 papers — assembled using her SciX literature tools (semantic embeddings, citation graphs, and reading-order signals).

  1. A Survey on Large Language Models for Code Generation ↗

    2024 · 155 cites

    The lay of the land: how code-generation models are surveyed and measured.

  2. Calibration and Correctness of Language Models for Code ↗

    2024 · 28 cites

    Calibration: whether the models know when they are right, an honesty signal most benchmarks ignore.

  3. LLM-Assisted Code Cleaning For Training Accurate Code Generators ↗

    2023 · 18 cites

    What goes into training shapes what the benchmark sees: data quality for code generators.

  4. Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey ↗

    2024 · 7 cites

    Moving past static pass@k toward outcome-based RL signals.

Open literature

From my ADS library

Benchmarks 62 papers Browse all →

Pulled from her curated ADS library.

From my ADS library

Code Generation & Retrieval 41 papers Browse all →

Pulled from her curated ADS library.

Explore the links