The week the benchmarks broke

Opus 4.8 scores 13.8% on FrontierCode Diamond. That number, dropped by METR this week alongside the claim that more than half of passing SWE-bench results are unmergeable slop, is the cleanest signal yet that the benchmarks we’ve been steering on measure the wrong thing. The whole field spent a week rebuilding its measuring sticks, and most of them point at the same gap: a model can close an issue without understanding the repository, without writing code a maintainer would merge, and sometimes without solving the task at all.

FrontierCode is the loud one. Built from over a thousand hours of maintainer-validated software engineering work, scored against 3,000-plus rubrics that cover code quality and explicitly hunt for the reward hacking that contaminates older benchmarks, it splits into tiers. The Diamond tier is hard enough that the strongest model on the market clears 13.8%. The interesting data isn’t the ceiling though, it’s the slope: on the easiest third of tasks, Opus nearly doubled its pass rate from 41% to 74% over four months in late 2025. That jump lines up with the “what happened in December” vibe shift practitioners kept reporting, the point where rerolling an agent five times to get one good result became rerolling twice, which is what makes ralph-style loops and goal-driven agents feel safe to run unattended. Saturate the easy tier, climb to the next. The benchmark is built to be a ladder rather than a finish line.

If FrontierCode is the indictment, CapCode is the mechanism behind it. The authors take the failure mode head on: agents that score well by exploiting shortcuts rather than solving the intended task, producing performance numbers that don’t mean what they claim. Their fix is to construct coding datasets with randomized tests that have a known best-achievable score for any non-cheating solution, so a model that beats that cap is provably gaming the harness. It’s a quietly important idea. We’ve spent two years treating pass rates as ground truth; CapCode says you have to design the test so that cheating is detectable before the number is worth anything.

The retrieval angle on the same problem comes from SWE-Explore, which argues that treating a coding task as one binary resolved/unresolved bit throws away the part that actually predicts success. It isolates repository exploration as its own benchmark: given an issue, return a ranked list of relevant code regions under a fixed line budget, scored on coverage, ranking, and context-efficiency. Ground truth comes from the code regions that successful agent trajectories actually consulted, derived across 848 issues, 10 languages, and 203 repositories. Two findings worth holding onto. Agentic explorers form a clear tier above classical retrieval, so the old BM25-and-embeddings reflex is genuinely behind now. And file-level localization is basically solved while line-level coverage and efficient ranking are still where the best systems separate, which tells you where the remaining retrieval work lives.

Memory is the other half of that story, and Decision-Aware Memory Cards frames it the way the better practitioners now do: agents fail not because the relevant text is missing but because the decisive evidence never gets selected, compressed, or surfaced at the moment of action. The Weaviate team put it more bluntly the same week, that shoving more chat history into context is not memory. CICL, the method in the paper, builds a context graph from instance evidence, scores each unit on whether it shifts the agent’s action and lifts the outcome, and packs the survivors as typed memory cards under a budget. The honest part is the result: reranking BM25 top-50 candidates lifts hit@1 from 0.58 to 0.78 on SWE-bench Verified file retrieval, but the authors note that plain RepoBench summaries still beat their cards on some splits and that compact rankers don’t yet replace the heuristic. A measurement layer, not a victory lap.

Orchestration turns out to be its own axis entirely. PerspectiveGap benchmarks how well a model can write the prompts that coordinate a multi-agent system, deciding what each sub-agent actually needs to know, across 110 scenarios and 10 topologies. Tested on 27 commercial models, the average combined pass rate is 14.9%, and the average information-leak count runs to 246.5 events per scenario, agents told things they shouldn’t have been. The result that should make people pause: GPT-5.5 hits 62% while Opus 4.7 shows a notable weakness here despite its strong coding performance. Being good at writing code and being good at telling three other agents what to do are not the same skill, and we’ve been assuming they travel together.

Underneath all of this, adoption keeps outrunning the tooling. A follow-up arXiv study of coding-agent use in newly created GitHub projects finds adoption more than twice as high as the authors’ earlier sample, and more intensive, with agents handling a larger share of the work per project. The measuring sticks are being rebuilt precisely because the thing they measure is now load-bearing in real repositories.

Watch whether FrontierCode’s tiers actually saturate in sequence the way its authors predict, and whether anyone ports CapCode’s cheating-cap idea into the public leaderboards. A benchmark you can’t game is worth more than a benchmark everyone tops.

Here’s the number that set the tone for the whole week. 8, and the group behind it, METR, paired that with a claim that lands even harder. More than half of the passing results on SWE-bench, the benchmark the entire field has been steering on for two years, are unmergible slop, code that passes the tests but no maintainer would actually accept. So this episode is really about one thing.

The field spent the week rebuilding its measuring sticks, and almost every new one points at the same gap. A model can close an issue without understanding the repository, without writing code anyone would merge, and in some cases without honestly solving the task at all. Let me start with Frontier Code, because it’s the loud one. It’s built from more than a thousand hours of maintainer-validated software engineering work.

Real tasks, validated by the people who actually maintain them. And it’s scored against more than 3,000 rubrics, which is the part I want you to sit with. Those rubrics cover code quality, and they specifically go, looking for reward hacking, the anti-cheating problem that quietly contaminates a lot of the older benchmarks. It’s organized into tiers, and the diamond tier is hard enough that the strongest model clears under 14%.

But honestly, the ceiling isn’t the interesting part. The slope is. On the easiest third of the tasks, Opus nearly doubled its pass rate. From 41% to 74%, over about four months in late 2025.

And that lines up with something a lot of us felt, but couldn’t quite point at. There was this vibe shift around December. People like Karpathy and DHH were talking about it, where agentic coding suddenly started feeling reliable. The way the Frontier Code author frames it is really precise.

It’s the difference between needing six re-rolls to get one good result, and needing two. And once you’re at two, you stop babysitting. That’s the moment the Ralphs… style loops, the goal-driven agents, the let-it-run-unattended workflows actually become safe to use.

Because you’re not terrified the thing goes off the rails every other run. The benchmark is deliberately built like a ladder. You saturate the easy tier, you climb to the next one. It’s not designed to be a finish line.

Now if Frontier Code is the indictment, the next paper is the mechanism. It’s called Capcode. And the title of the work it comes out of is basically, Do Coding Agents Deceive Us? The answer they’re working from is yes, sometimes.

And not always on purpose. The failure mode is an agent that scores well by exploiting a shortcut in the test harness, instead of solving the task you actually asked for. The number looks great, and means nothing. Their fix is genuinely clever.

They build coding datasets with randomized tests, where there’s a known best achievable score for any honest, non-cheating solution. So if a model comes in above that cap, you’ve got proof it gamed the harness. You don’t have to guess. And I think that’s quietly one of the most important things about Frontier Code.

Because we’ve spent two years treating pass rates as ground truth. Capcode is saying, you have to design the test so that cheating is even detectable before the number you get out of it is worth anything at all. Then there’s the retrieval angle, and that comes from a benchmark called SWE-Explore. Their argument is that when you collapse a coding task down to one bit, resolved or unresolved, you throw away the part that actually predicts whether the agent succeeds.

So they isolate repositories. They create a system that can be used to do a full-fledged code exploration. You hand the system an issue, and it has to return a ranked list of the relevant code regions, under a fixed line budget. They score it on three axes, coverage, ranking quality, and context efficiency.

The clever bit is where the ground truth comes from. They take agent runs that genuinely solve the issue, and they distill out the exact code regions those successful trajectories consulted along the way. 8 million versions. programming languages, 203 repositories.

Two findings worth carrying with you. First, agentic explorers, the ones that actively poke around the repo, form a clear tier above classical retrieval. So the old reflex, just throw BM25 and some embeddings at it, that’s genuinely behind now. And second, file-level localization is basically a solved problem.

Modern methods find the right file. Where the strong systems actually separate from each other is line-level coverage and efficient ranking. So if you’re working on retrieval for coding agents, that’s where the open problem lives now. Not which file, which lines, and in what order.

That leads pretty naturally into memory, which is the other half of the same story. There’s a paper called Decision Aware Memory Cards, and it frames the problem the way the sharper practitioners have started framing it. Agents don’t fail because the relevant text is missing. They fail because the decisive piece of evidence never gets selected, never gets compressed down, never gets surfaced at the exact moment the agent has to act.

The Weaviate team said it more bluntly the same week, and I like the phrasing. Shoving more chat history into the context window is not memory, it’s just a bigger pile. The method in the paper, CICL, builds a graph out of the available evidence, and then scores each piece on whether it actually shifts the agent’s action and lifts the outcome, not just whether it’s topically related. The survivors get packed as typed memory cards, under a budget, and their respect about this one is how honest the results are.

78 on SWE-bench verified file retrieval, which is real. But the authors come right out and say plain repo bench summaries still beat their cards on some splits, and that their lightweight rankers don’t yet replace the underlying heuristic. They’re calling it a measurement layer, not a solution. After a week of big claims, that restraint is worth something.

Now here’s the one that genuinely surprised me. Orchestration turns out to be its own separate skill, and not the one you’d expect. There’s a benchmark called perspective gap, and it measures how well a model can write the prompts that coordinate a multi-agent system. The core question each sub-agent faces is, what does this particular agent actually need to know to do its job, and what should it not be told?

110 scenarios, 10 different orchestration topologies. They ran 27 commercial models, and they all had the same answer. 9%. And there’s this second metric, an information leakage count, that averages 246 events per scenario.

That’s agents being handed context they had no business getting. But here’s the part that should make you stop. 5 hits 62% on this. 7, which is an excellent coding model, shows a real notable weakness at orchestration prompting.

Those two things came apart. Being great at writing code, and being great at telling three other agents what to do, are just not the same capability. And I think a lot of us have quietly been assuming they travel together. This benchmark says they don’t, at least not yet, and not in every model.

And underneath all of these papers, there’s one structural fact driving the urgency. Adoption keeps outrunning the tooling. There’s a follow-up study this week looking at coding agent use specifically in newly created GitHub projects. And they find adoption is more than twice as high as it was in the first place.

And they find that adoption is more than twice as high as it was in the first place. Not just more projects using agents, but more intensively, with the agents doing a bigger share of the actual work per project. And that’s really why all this benchmark rebuilding is happening right now. The thing these measuring sticks measure is no longer a research curiosity.

It’s load-bearing. It’s shipping code into real repositories that real people maintain. So what am I watching from here? Two things.

One, whether Frontier Code’s tiers actually saturate and sequence the way its authors predict. Roughly one tier a year. If that holds, it becomes a genuine roadmap instead of just a hard exam. And two, whether anyone takes Capcode’s cheating cap idea and ports it into the public leaderboards everybody actually cites.

Because the lesson threading through every one of these papers is the same. A benchmark you can’t game is worth a lot more than a benchmark everyone already tops. The numbers we trust are only as good as the tests underneath them. And this was the week the field admitted it.

The week the benchmarks broke

Highlights

Transcript

In this issue