Open thread

exploring

What makes multi-agent systems reliable enough to change real production code?

Most agent demos work once. Production is the opposite problem: the same task run a thousand times, with no one watching the single run that quietly corrupts a repository. The work at Sourcegraph lives underneath that, in orchestration, verification, and blast-radius control rather than in cleverer prompts. Below is what I'm building toward reliability, sitting next to the empirical work on where agents actually break.

agentscode-intelligenceevaluation

The work

Reading path

A generated path through 5 papers — assembled using her SciX literature tools (semantic embeddings, citation graphs, and reading-order signals).

  1. Agent Design Pattern Catalogue: Architectural Patterns for Foundation Model based Agents ↗

    2024 · 9 cites

    Vocabulary first: the architectural patterns agents are built from.

  2. OpenHands: An Open Platform for AI Software Developers as Generalist Agents ↗

    2024 · 92 cites

    A concrete open platform for agents that act on real codebases.

  3. Large Language Model-Based Agents for Software Engineering: A Survey ↗

    2024 · 40 cites

    The survey view across the software-engineering lifecycle.

  4. Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study ↗

    2026

    Then the failure modes: an empirical dissection of where agentic frameworks break.

  5. Engineering an LLM-Powered Multi-agent Framework for Autonomous CloudOps ↗

    2025

    And a production deployment, where reliability stops being academic.

Open literature

From my ADS library

Coding Agents 85 papers Browse all →

Pulled from her curated ADS library.

From my Enterprise Multi-Agent Reliability

50 papers · 8 themes
Reliability & failure modesRecovery & durable stateObservability & tracingEvaluation & assuranceCost, routing & schedulingTopology & coordinationSecurity & governanceHuman oversight & collaboration
  • Reliability & failure modes

    Does fan-out actually help? Mostly only when measured. Know the failure taxonomy before you scale.

  • Recovery & durable state

    Retries around side effects are transactions, not control-flow. Durable state beats conversational handoff.

  • Observability & tracing

    Transcripts are not observability. Capture a structured, replayable trace and correlate intent with action.

  • Evaluation & assurance

    Ground evaluation in execution and an honest baseline. Calibrate LLM judges. Treat eval as risk reduction, not proof.

  • Cost, routing & scheduling

    Cascades beat always-on fan-out. Record cost facts before optimizing routing; never treat unknown cost as zero.

  • Topology & coordination

    Choose topology deliberately. Prefer dynamic task graphs; keep roles in config, not code.

  • Security & governance

    Treat tool/RAG output as untrusted. Topology is an attack surface. Injection propagates across agents.

  • Human oversight & collaboration

    Keep humans on the reasoning chain. Design escalation routes that consume decisions, not idle alerts.

Explore the links