Open thread
exploringWhat makes multi-agent systems reliable enough to change real production code?
Most agent demos work once. Production is the opposite problem: the same task run a thousand times, with no one watching the single run that quietly corrupts a repository. The work at Sourcegraph lives underneath that, in orchestration, verification, and blast-radius control rather than in cleverer prompts. Below is what I'm building toward reliability, sitting next to the empirical work on where agents actually break.
The work
- Gas City Project
- Gas City Dashboard Project
- Cross-Repo Invariant Verifier Project
- Coding Agent Workflows Project
- Why "Agent Advocate" exists Writing
- Why your coding agent keeps failing in ways you can't predict Writing
- I used two multi-agent pipelines for everything I built this week. Here's what happened. Writing
- Building a Software Factory Talk
Reading path
A generated path through 5 papers — assembled using her SciX literature tools (semantic embeddings, citation graphs, and reading-order signals).
-
Agent Design Pattern Catalogue: Architectural Patterns for Foundation Model based Agents ↗
Vocabulary first: the architectural patterns agents are built from.
-
OpenHands: An Open Platform for AI Software Developers as Generalist Agents ↗
A concrete open platform for agents that act on real codebases.
-
Large Language Model-Based Agents for Software Engineering: A Survey ↗
The survey view across the software-engineering lifecycle.
-
Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study ↗
Then the failure modes: an empirical dissection of where agentic frameworks break.
-
Engineering an LLM-Powered Multi-agent Framework for Autonomous CloudOps ↗
And a production deployment, where reliability stops being academic.
Open literature
- Agent Design Pattern Catalogue: Architectural Patterns for Foundation Model based Agents ↗
- OpenHands: An Open Platform for AI Software Developers as Generalist Agents ↗
- Large Language Model-Based Agents for Software Engineering: A Survey ↗
- Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study ↗
- Engineering an LLM-Powered Multi-agent Framework for Autonomous CloudOps ↗
From my ADS library
- DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning ↗
- Kimi K2: Open Agentic Intelligence ↗
- From System 1 to System 2: A Survey of Reasoning Large Language Models ↗
- Multi-Agent Collaboration Mechanisms: A Survey of LLMs ↗
- Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions ↗
- A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence ↗
Pulled from her curated ADS library.
From my Enterprise Multi-Agent Reliability
-
Reliability & failure modes
Does fan-out actually help? Mostly only when measured. Know the failure taxonomy before you scale.
-
Recovery & durable state
Retries around side effects are transactions, not control-flow. Durable state beats conversational handoff.
-
Observability & tracing
Transcripts are not observability. Capture a structured, replayable trace and correlate intent with action.
-
Evaluation & assurance
Ground evaluation in execution and an honest baseline. Calibrate LLM judges. Treat eval as risk reduction, not proof.
-
Cost, routing & scheduling
Cascades beat always-on fan-out. Record cost facts before optimizing routing; never treat unknown cost as zero.
-
Topology & coordination
Choose topology deliberately. Prefer dynamic task graphs; keep roles in config, not code.
-
Security & governance
Treat tool/RAG output as untrusted. Topology is an attack surface. Injection propagates across agents.
-
Human oversight & collaboration
Keep humans on the reasoning chain. Design escalation routes that consume decisions, not idle alerts.