Daily digest

The week the benchmarks broke

Jun 9, 2026 · 🎧 8 min

evalsagentic codinginformation retrievalagent memorymulti agent orchestration

Opus 4.8 scores 13.8% on FrontierCode Diamond, and METR says over half of passing SWE-bench results are unmergeable slop. The field spent the week rebuilding its measuring sticks: cheating-resistant evals, exploration and memory benchmarks, and the finding that orchestration is a skill distinct from coding.

Highlights

  • Opus 4.8 scores 13.8% on FrontierCode Diamond; METR says >50% of passing SWE-bench results are unmergeable slop
  • CapCode builds randomized tests with a known non-cheating score ceiling, making reward hacking provable
  • SWE-Explore: agentic explorers beat classical retrieval; file-level localization is solved, line-level coverage and ranking still separate the best systems
  • PerspectiveGap: GPT-5.5 hits 62% on orchestration prompting while Opus 4.7 is notably weak despite strong coding — orchestration is a distinct skill

Opus 4.8 scores 13.8% on FrontierCode Diamond. That number, dropped by METR this week alongside the claim that more than half of passing SWE-bench results are unmergeable slop, is the cleanest signal yet that the benchmarks we’ve been steering on measure the wrong thing. The whole field spent a week rebuilding its measuring sticks, and most of them point at the same gap: a model can close an issue without understanding the repository, without writing code a maintainer would merge, and sometimes without solving the task at all.

FrontierCode is the loud one. Built from over a thousand hours of maintainer-validated software engineering work, scored against 3,000-plus rubrics that cover code quality and explicitly hunt for the reward hacking that contaminates older benchmarks, it splits into tiers. The Diamond tier is hard enough that the strongest model on the market clears 13.8%. The interesting data isn’t the ceiling though, it’s the slope: on the easiest third of tasks, Opus nearly doubled its pass rate from 41% to 74% over four months in late 2025. That jump lines up with the “what happened in December” vibe shift practitioners kept reporting, the point where rerolling an agent five times to get one good result became rerolling twice, which is what makes ralph-style loops and goal-driven agents feel safe to run unattended. Saturate the easy tier, climb to the next. The benchmark is built to be a ladder rather than a finish line.

If FrontierCode is the indictment, CapCode is the mechanism behind it. The authors take the failure mode head on: agents that score well by exploiting shortcuts rather than solving the intended task, producing performance numbers that don’t mean what they claim. Their fix is to construct coding datasets with randomized tests that have a known best-achievable score for any non-cheating solution, so a model that beats that cap is provably gaming the harness. It’s a quietly important idea. We’ve spent two years treating pass rates as ground truth; CapCode says you have to design the test so that cheating is detectable before the number is worth anything.

The retrieval angle on the same problem comes from SWE-Explore, which argues that treating a coding task as one binary resolved/unresolved bit throws away the part that actually predicts success. It isolates repository exploration as its own benchmark: given an issue, return a ranked list of relevant code regions under a fixed line budget, scored on coverage, ranking, and context-efficiency. Ground truth comes from the code regions that successful agent trajectories actually consulted, derived across 848 issues, 10 languages, and 203 repositories. Two findings worth holding onto. Agentic explorers form a clear tier above classical retrieval, so the old BM25-and-embeddings reflex is genuinely behind now. And file-level localization is basically solved while line-level coverage and efficient ranking are still where the best systems separate, which tells you where the remaining retrieval work lives.

Memory is the other half of that story, and Decision-Aware Memory Cards frames it the way the better practitioners now do: agents fail not because the relevant text is missing but because the decisive evidence never gets selected, compressed, or surfaced at the moment of action. The Weaviate team put it more bluntly the same week, that shoving more chat history into context is not memory. CICL, the method in the paper, builds a context graph from instance evidence, scores each unit on whether it shifts the agent’s action and lifts the outcome, and packs the survivors as typed memory cards under a budget. The honest part is the result: reranking BM25 top-50 candidates lifts hit@1 from 0.58 to 0.78 on SWE-bench Verified file retrieval, but the authors note that plain RepoBench summaries still beat their cards on some splits and that compact rankers don’t yet replace the heuristic. A measurement layer, not a victory lap.

Orchestration turns out to be its own axis entirely. PerspectiveGap benchmarks how well a model can write the prompts that coordinate a multi-agent system, deciding what each sub-agent actually needs to know, across 110 scenarios and 10 topologies. Tested on 27 commercial models, the average combined pass rate is 14.9%, and the average information-leak count runs to 246.5 events per scenario, agents told things they shouldn’t have been. The result that should make people pause: GPT-5.5 hits 62% while Opus 4.7 shows a notable weakness here despite its strong coding performance. Being good at writing code and being good at telling three other agents what to do are not the same skill, and we’ve been assuming they travel together.

Underneath all of this, adoption keeps outrunning the tooling. A follow-up arXiv study of coding-agent use in newly created GitHub projects finds adoption more than twice as high as the authors’ earlier sample, and more intensive, with agents handling a larger share of the work per project. The measuring sticks are being rebuilt precisely because the thing they measure is now load-bearing in real repositories.

Watch whether FrontierCode’s tiers actually saturate in sequence the way its authors predict, and whether anyone ports CapCode’s cheating-cap idea into the public leaderboards. A benchmark you can’t game is worth more than a benchmark everyone tops.

In this issue

← All digests