What makes multi-agent systems reliable enough to change real production code?

Most agent demos work once. Production is the opposite problem: the same task run a thousand times, with no one watching the single run that quietly corrupts a repository. The work at Sourcegraph lives underneath that, in orchestration, verification, and blast-radius control rather than in cleverer prompts. Below is what I'm building toward reliability, sitting next to the empirical work on where agents actually break.

agentscode-intelligenceevaluation

The work

Gas City Project
Gas City Dashboard Project
Cross-Repo Invariant Verifier Project
Coding Agent Workflows Project
Why "Agent Advocate" exists Writing
Why your coding agent keeps failing in ways you can't predict Writing
I used two multi-agent pipelines for everything I built this week. Here's what happened. Writing
Building a Software Factory Talk

Reading path

A generated path through 5 papers — assembled using her SciX literature tools (semantic embeddings, citation graphs, and reading-order signals).

Agent Design Pattern Catalogue: Architectural Patterns for Foundation Model based Agents ↗

2024 · 9 cites

Vocabulary first: the architectural patterns agents are built from.
OpenHands: An Open Platform for AI Software Developers as Generalist Agents ↗

2024 · 92 cites

A concrete open platform for agents that act on real codebases.
Large Language Model-Based Agents for Software Engineering: A Survey ↗

2024 · 40 cites

The survey view across the software-engineering lifecycle.
Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study ↗

2026

Then the failure modes: an empirical dissection of where agentic frameworks break.
Engineering an LLM-Powered Multi-agent Framework for Autonomous CloudOps ↗

2025

And a production deployment, where reliability stops being academic.

Open literature

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning ↗ Guo, Daya et al. · 2025 · 10464 cites
Kimi K2: Open Agentic Intelligence ↗ Kimi Team et al. · 2025 · 911 cites
From System 1 to System 2: A Survey of Reasoning Large Language Models ↗ Li, Zhong-Zhi et al. · 2025 · 334 cites
Multi-Agent Collaboration Mechanisms: A Survey of LLMs ↗ Tran, Khanh-Tung et al. · 2025 · 332 cites
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions ↗ Hou, Xinyi et al. · 2025 · 287 cites
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence ↗ Gao, Huan-ang et al. · 2025 · 158 cites

Pulled from her curated ADS library.

From my Enterprise Multi-Agent Reliability

50 papers · 8 themes

Open explorer ↗ On the site

Reliability & failure modesRecovery & durable stateObservability & tracingEvaluation & assuranceCost, routing & schedulingTopology & coordinationSecurity & governanceHuman oversight & collaboration

Reliability & failure modes

Does fan-out actually help? Mostly only when measured. Know the failure taxonomy before you scale.
Recovery & durable state

Retries around side effects are transactions, not control-flow. Durable state beats conversational handoff.
Observability & tracing

Transcripts are not observability. Capture a structured, replayable trace and correlate intent with action.
Evaluation & assurance

Ground evaluation in execution and an honest baseline. Calibrate LLM judges. Treat eval as risk reduction, not proof.
Cost, routing & scheduling

Cascades beat always-on fan-out. Record cost facts before optimizing routing; never treat unknown cost as zero.
Topology & coordination

Choose topology deliberately. Prefer dynamic task graphs; keep roles in config, not code.
Security & governance

Treat tool/RAG output as untrusted. Topology is an attack surface. Injection propagates across agents.
Human oversight & collaboration

Keep humans on the reasoning chain. Design escalation routes that consume decisions, not idle alerts.

What makes multi-agent systems reliable enough to change real production code?

The work

Reading path

Agent Design Pattern Catalogue: Architectural Patterns for Foundation Model based Agents ↗

OpenHands: An Open Platform for AI Software Developers as Generalist Agents ↗

Large Language Model-Based Agents for Software Engineering: A Survey ↗

Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study ↗

Engineering an LLM-Powered Multi-agent Framework for Autonomous CloudOps ↗

Open literature

From my ADS library

From my Enterprise Multi-Agent Reliability

Explore the links