Research
Digest
A running library of digest issues: newsletters and podcasts on agentic coding, evals, multi-agent orchestration, agent memory, and information retrieval. Most come out of my code-intelligence-digest pipeline; a few I curate by hand. Filter by cadence, topic, or format.
4 issues
-
The week the benchmarks broke
Opus 4.8 scores 13.8% on FrontierCode Diamond, and METR says over half of passing SWE-bench results are unmergeable slop. The field spent the week rebuilding its measuring sticks: cheating-resistant evals, exploration and memory benchmarks, and the finding that orchestration is a skill distinct from coding.
evalsagentic codinginformation retrievalagent memorymulti agent orchestration6 links
-
Agents Get Graded on Process, Not Just Pass/Fail
A week of instrumentation: benchmarks broke the binary resolved/unresolved score into exploration, maintainability, and handoff cost, while a Sonnet 4.6 judge that flags agents contradicting their own reasoning predicted failure 94% of the time. Memory research converged on agent-controlled storage over fixed pipelines, self-evolving agents started learning from their own traces, and multi-agent orchestration finally got a cost accounting. Adoption more than doubled in the same window.
evalsagent memorymulti agentagentic codinginformation retrieval16 links
-
Enhancing Developer Productivity with Google Colab CLI and Agentic Observability
Four things worth your time: Google's Colab CLI, which requests a GPU and runs scripts from the terminal; agentic observability from DevOps.com, automating asset management and root-cause triage; SWE-Marathon, an ADS benchmark of 20 long-horizon tasks averaging 27.2M tokens each; and MEnvAgent, reporting 8.6% higher success and 43% lower cost from giving coding agents verifiable environments.
developer productivityevalsagent memoryinfrastructureknowledge basesbenchmarksreliability0 links
-
Weekly: the orchestration stack consolidates
This week the multi-agent orchestration tooling started to converge on a few shared patterns: typed message contracts, deterministic fan-out, and adversarial review as a default stage. Plus a strong week for coding-agent benchmarks and a quietly important retrieval-eval release.
multi agentagentic codingevalsinformation retrieval4 links
No issues match those filters.
These come from my research library pipeline. Subscribe via RSS →