The Knowledge Hub

Orchestration

What the 2025–2026 production data says about multi-agent coordination: which topologies survive, which fail, and what the numbers cost.

10 min read · Knowledge Hub module · by Kenny

Last reviewed June 2026

What this edition covers

  • Initial publish — June 2026 edition
  • MAST failure taxonomy (14 modes, 1,600+ traces) summarised with derived 79% figure correctly attributed
  • 2025 coordination-overhead and error-amplification data by topology documented
  • Anthropic multi-agent research system benchmarks included (90.2% gain, 15× token cost)
  • Framework landscape — LangGraph, CrewAI, OpenAI Agents SDK, Google ADK — compared on production readiness

Most multi-agent AI systems don’t fail because the underlying model isn’t capable enough. They fail because of how the agents are connected. That finding is the load-bearing result from the most rigorous public analysis of multi-agent failure modes to date — and it changes what questions a founder or operator should be asking.

This module covers what the 2025–2026 research and production data say about orchestration: how to think about topology choices, what the coordination overhead actually costs, and what running a multi-agent workflow over months requires that a single-session deployment does not.

Why most failures aren’t model failures

A research team at MIT and several collaborating institutions analysed 1,600+ annotated traces across seven popular multi-agent frameworks. They identified 14 distinct failure modes and grouped them into three categories: system design issues, inter-agent misalignment, and task verification failures.1

The distribution matters. System-design failures account for 42% of cases; inter-agent coordination failures account for 37%. Together, those two categories cover approximately 79% of failures in the dataset — derived from the MAST breakdown, not a separately stated figure. Task-verification failures — agents that declare a task complete when it is not — account for a further 23.5%; because a single trace can exhibit more than one failure mode, these categories overlap rather than partition cleanly.1

The specific failure modes with the highest frequency: step repetition (15.7%), missing termination awareness (12.4%), disobeying task specifications (11.8%), reasoning-action mismatch (13.2%). Step repetition and missing termination awareness together — 28.1% of all failures — are both design failures. An agent that lacks explicit “when to stop” criteria will keep going. An agent without duplicate-detection logic will repeat work. Neither of those is a model problem. Both are fixable at the system level.

The same research tested two interventions: workflow modifications (+9.4% success improvement) and enhanced verification (+15.6% success improvement). A well-designed verification gate is worth more than a model upgrade. That is an empirical claim, grounded in the trace data.1

What topology choice actually costs

A 2025 scaling study quantified coordination overhead and error amplification across topology types. The numbers are specific enough to be operationally useful.2

Coordination overhead as a percentage of total token budget: independent parallel agents at 58%; centralised hub-and-spoke at 285%; decentralised peer topology at 263%; hybrid configurations at 515%. Error amplification relative to a single-agent baseline: centralised at 4.4×; decentralised at 7.8×; independent parallel agents at 17.2×.2

Read those figures together. Centralised hub-and-spoke has higher token overhead than decentralised (285% vs 263%), but substantially lower error amplification (4.4× vs 7.8×). Independent parallel agents have the lowest coordination overhead but the highest error amplification — because there is no correction mechanism. Errors in one agent’s output propagate unchecked.

Turn count scales as a power law: T = 2.72 × (n + 0.5)^1.724. At three agents, that means roughly 24 turns in a centralised topology versus about 5.5 for a single agent. Each additional agent adds disproportionately more coordination turns. The study found that reasoning capacity becomes prohibitively thin beyond three or four agents under fixed token budgets.2 The practical agent count for a single coordination layer is three to five, not ten or more.

There is also a critical threshold: performance gains from additional agents disappear when single-agent baseline accuracy exceeds approximately 45%. Below that threshold, adding agents genuinely helps. Above it, coordination overhead begins consuming the gains. Before building a multi-agent system, measure what a single agent achieves on the target task. If single-agent accuracy is already adequate, a multi-agent architecture adds cost without proportionate quality improvement.2

The topology is not just a software architecture choice. It determines failure blast radius, observability, human intervention points, and operating cost — all at once.

The three topologies that survived production

Not all topology choices are equally viable for sustained operational systems. The production evidence as of 2025–2026 converges on three patterns that work, and one that consistently does not.3

Orchestrator-worker (hub-and-spoke). A single orchestrator agent receives the top-level task, decomposes it into subtasks, delegates to specialist workers, and aggregates results. Workers do not communicate directly with each other. All coordination flows through the orchestrator. This is the dominant production pattern. Anthropic’s multi-agent research system — the most extensively documented real-world implementation — uses exactly this model: a lead agent plans and spawns three to five specialist subagents in parallel, then synthesises their outputs in a separate pass. Performance gain over single-agent Opus 4 on internal evaluations: 90.2%. Token cost: approximately 15× a single-agent interaction.4 The tradeoff is explicit and quantified.

Key design decisions that make the pattern work in practice: detailed task mandates that prevent subagent scope overlap; embedded scaling rules in the orchestrator’s prompt (simple queries spawn one subagent, complex queries spawn more); clean context handoffs that pass a distilled summary rather than full conversation history when context limits approach. The Anthropic system’s synchronous execution — waiting for all subagents to complete before proceeding — is architecturally simpler but gates the system on the slowest subagent. That is a conscious design tradeoff, not an oversight.4

Sequential workflow (assembly-line). Tasks pass through a fixed sequence of agents, each performing a specific transformation. No dynamic routing; topology is determined at design time. Simpler to test, cheaper to operate, and easier to debug. Best suited to workflows with a predictable shape: well-defined input and output types at each stage, a stable transformation sequence. In practice, most production pipelines add conditional branching at a few decision points rather than operating as a pure linear sequence — a research phase that can loop back if initial results are insufficient, for instance.

Bounded peer collaboration. A small number of specialist agents — typically two to four — collaborating on a shared problem with explicit scope boundaries. Unlike unconstrained swarms, agents operate within defined domains. The “bounded” constraint is the load-bearing design feature: it prevents the coordination explosion that occurs when agents can route work freely in all directions. Appropriate for collaborative analysis tasks where the orchestrator genuinely cannot know in advance which specialist should act next.

Unconstrained swarms do not work in production. Free-form swarms, where agents route work to each other based on their own inferences without a central orchestrator, appear consistently in failure taxonomies. Race conditions scale as N(N-1)/2 with N agents: three agents produce three potential concurrent interactions; ten agents produce 45. State desynchronisation is among the top six inter-agent failure modes in the MAST taxonomy. The theoretical appeal of emergent coordination has not translated to production reliability.1

The framework landscape in 2025–2026

Four frameworks have consolidated into the working short list. Understanding what each is built for matters more than identifying which one is “best.”

LangGraph — v1.0 in October 2025, v1.1 in December 2025 — is the highest-reliability framework for state-critical production pipelines. Its architecture: explicit state management via directed graphs, with typed schemas and reducer functions controlling how concurrent writes are merged. Checkpointing saves state at every node execution; no lost work on failure. For production, PostgreSQL-backed checkpointing is recommended — SQLite becomes a write bottleneck under concurrency. The tradeoff: steepest learning curve and most verbose setup code of the evaluated frameworks. Fully model-agnostic.5

CrewAI offers the fastest path to a working multi-agent system. Agents are defined by role, goal, and backstory; tasks are assigned to agents; a process type — sequential or hierarchical — governs execution. Minimal boilerplate. The hierarchical process instantiates a manager agent who delegates to crew members, which is a lightweight orchestrator-worker pattern inside the role-crew framing. Enterprise-grade observability and scheduling shipped in 2026. Best for teams who want to assemble functional workflows quickly without deep graph programming knowledge.5

OpenAI Agents SDK, released March 2025 as the production successor to the experimental Swarm framework, formalises two coordination modes: Agent.as_tool(), where the orchestrator calls a specialist for a bounded subtask and gets back a result; and explicit handoffs, where the active agent transfers control to a specialist who takes over entirely. Use agents-as-tools when one agent should own the final answer and outputs from multiple specialists need combining. Use handoffs when the specialist should respond directly and conversation routing is part of the workflow logic. Constraint: tightly coupled to OpenAI’s models; not model-agnostic.5

Google Agent Development Kit (ADK), released April 2025, is optimised for hierarchical routing and deep Google Cloud integration. Agents declare capabilities; LLMs route tasks based on agent descriptions. Supports sequential pipelines, parallel execution, loop-based structures, and dynamic routing. Available in Python, TypeScript, Go, and Java. Best for teams building on Google Cloud or requiring enterprise data connectors.6

There is also the direct-dispatch option: no framework, orchestration logic written directly against the model provider’s SDK. For systems whose full execution graph fits in fewer than 50 lines of control flow, direct-dispatch is simpler and cheaper to maintain than any framework. When the system needs checkpointing, dynamic routing, parallel execution management, or human-in-the-loop interrupts, a framework earns its overhead. RTSN’s own engine uses flat direct-dispatch (per ADR-001), managing state externally. That is a defensible architecture for a controlled system with explicit SOPs; it trades framework features for full control and no dependency on framework evolution.

Handoffs, shared state, and where things break

A handoff is the mechanism by which one agent transfers task ownership — and the associated context — to another agent. The handoff packet must contain everything the receiving agent needs to continue the task without re-deriving context already established. The Anthropic engineering documentation describes the practical implementation: when a subagent’s context limit approaches, it spawns a fresh subagent with clean context, passing a distilled summary rather than full conversation history. This prevents context window exhaustion from blocking long-running workflows.4

State synchronisation failure is one of the three primary failure categories in the MAST taxonomy. The mechanism is straightforward: Agent B acts on information Agent A has already updated. In sequential pipelines this is manageable; in parallel topologies it is a structural risk. Three approaches manage it in production: centralised typed state with reducer functions (LangGraph’s model); pure message-passing with no shared state (cleaner isolation, higher per-interaction latency); shared filesystem (the Anthropic model — simpler than typed state, adequate for sequential workflows, can produce write conflicts under high parallelism).1

Trust boundaries matter in multi-agent systems in a way they do not in single-agent deployments. Subagents should not hold broader permissions than their defined subtask requires. An orchestrator should fail fast on subagent failure rather than cascade. Prompt injection via subagent output is a real attack surface: a malicious input to a subagent can propagate upward through handoffs if the orchestrator trusts subagent output without validation.

In 2025, three interoperability protocols emerged to address coordination across frameworks and vendors: Anthropic’s MCP (Model Context Protocol) for tool access via JSON-RPC; Google’s Agent-to-Agent (A2A) protocol for peer-to-peer task delegation using capability-based Agent Cards; IBM’s Agent Communication Protocol (ACP) for RESTful multipart messaging. A 2025 survey of all four active protocols recommends a phased adoption path: MCP first for tool integration, then ACP or A2A for agent-to-agent coordination.7 For self-contained systems where all agents are under the same operator’s control, MCP for tool access is the relevant standard. A2A becomes relevant when coordinating with external agents built on different stacks.

What changes when the system runs over months

Research and framework documentation tend to present orchestration as a deployment decision. For AI-operated workflows that run alongside a team across months of engagement, it is also a maintenance decision. Several operational failure modes do not surface at launch.

Prompt drift. Model provider updates — which happen without breaking-version announcements — change agent behaviour without code changes. A supervisor agent whose routing prompt was calibrated against one model version will behave differently after a provider update. Production systems require regression testing at model-update points, not only at deployment.

State accumulation. Persistent state grows. An operational agent that accumulates checkpointed state across hundreds of workflow runs will eventually encounter performance degradation in state retrieval. Explicit state compaction — periodic summarisation of older state records — is the standard remedy.

Observability decay. Tracing and logging instrumentation degrades as systems evolve: log schemas change, new agents are added without observability wiring, trace IDs are inconsistently propagated. Production reliability data suggests that teams using comprehensive agent tracing resolve failures significantly faster than teams relying on log-based debugging alone — though the specific reduction figure from the Maxim AI analysis is operator-reported rather than peer-reviewed.8 Investing in observability infrastructure at the start, not as a retrofit, is the reliable operational pattern.

Human intervention points. The operational case for hub-and-spoke over swarm architectures is clearest here. A centralised orchestrator provides a predictable, consistent surface for human review. Pause an orchestrator before it delegates the next batch of subtasks; inspect its state; override its routing decision. In peer-to-peer topologies, the intervention point depends on which agent happens to be active at the moment of intervention. There is no stable state surface to examine.

For a methodology with explicit quality gates — the kind where a human needs to inspect state and approve or redirect at defined checkpoints — the orchestration topology must have reliable pause points. The hub-and-spoke pattern with checkpoint-based state persistence satisfies that requirement. Swarm topologies do not.

Multiple independent data points converge on three to five as the practical agent count for a single coordination layer: Anthropic’s research system spawns three to five subagents per complex query; the 2025 scaling study puts the reasoning-capacity threshold at three to four under fixed token budgets; most production examples in framework documentation show three to five specialist agents per orchestrator. Larger agent counts are achievable via hierarchical topologies — multiple layers of orchestration, each layer small — rather than by expanding a single flat coordination layer.24

The Knowledge Hub updates monthly — each module re-checked, with what changed summarised at the top. Follow the modules you care about; skip the ones you don’t.

Get monthly Notes from RTSN

See how RTSN wires orchestration into a workflow that runs alongside your team.