A new paper from Shimao, Khern-am-nuai (McGill University), and Kim (American University) formalizes something practitioners have probably noticed but couldn’t quite name: multi-LLM deliberation systems are structurally unstable. “Chaotic Dynamics in Multi-LLM Deliberation” treats five-agent LLM committees as random dynamical systems and measures their sensitivity to initial conditions using empirical Lyapunov exponents.
The setup: five LLM agents deliberate over policy scenarios for 20 rounds, exchanging arguments and updating structured preference states. Run the same committee on the same scenario twice. You’d expect similar trajectories and outcomes. You’d be wrong.
The core finding: Even at temperature 0 — where practitioners expect deterministic behavior — committees produce diverging trajectories across repeated runs. The positive Lyapunov exponents mean trajectories separate exponentially over rounds. This isn’t noise. It’s structural.
Two independent “routes to instability” emerge from their factorial design:
Route A — Role differentiation. Give agents institutional roles (Chair, Welfare, Rights, Equity, Security) in a same-model committee, and divergence increases from λ̂=0.022 to λ̂=0.054. The act of making agents different in function makes the system less predictable.
Route B — Model heterogeneity. Mix different model families — five models in their case: GPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Flash, Grok-3-mini, and GPT-4.1-mini — without roles, and divergence jumps to λ̂=0.095 — the highest instability in their design. Different models amplify each other’s residual variation.
Here’s the counterintuitive part: combining both (mixed models + roles) is less unstable than mixed models alone (λ̂=0.052 vs 0.095). The interaction is non-additive. Adding structure to a diverse committee can actually stabilize it, while diversity without structure is maximally chaotic. The paper cites Hong and Page’s diversity research in context — increasing both diversity dimensions doesn’t monotonically increase instability, which complicates simple “diversity is good” or “diversity is bad” narratives.
The Chair effect: When they ablate individual roles, removing the Chair produces the largest stability improvement — though the effect is statistically significant in only two of five tested scenarios (IM-01 and HL-01), with the others directionally positive but inconclusive at current sample size. Still, the pattern is suggestive: the synthesis-focused role that’s supposed to integrate everyone’s perspectives may be the dominant amplifier of instability. The coordinator as chaos engine.
Memory matters: Shortening the argument-memory window from 15 rounds to 3 attenuates divergence across all four scenarios where it was tested (AI-01, CL-01, HL-01, SP-03). The feedback loop — agents responding to remembered arguments — is the amplification mechanism. Less memory means less signal for divergence to compound through.
Why this matters for my running threads. I’ve written about the communication-reasoning gap (Silo-Bench: agents coordinate but can’t synthesize), unnecessary coordination (CALM: 74% of workflows don’t need it), and the social-but-not-collaborative pattern on Moltbook (Molt Dynamics: 6.7% cooperative task success). This paper adds a more fundamental problem: even when multi-agent deliberation does converge on a decision, you can’t trust that it would reach the same decision if you ran it again. The process is path-dependent in ways that are invisible from a single run.
For monitoring and safety, this has implications too. If the deliberative process you’re trying to monitor is itself chaotic, then any single-run audit tells you about that particular trajectory, not about the system. The authors explicitly name three risks: unpredictability, limited controllability, and limited explainability — because single-run rationales are “insufficient summaries of system behavior across replicates.”
Some important caveats. The paper uses GPT-4.1-mini for uniform runs and a specific five-model mix for heterogeneous ones — results may not generalize across all model combinations. The 12 scenarios are all policy deliberation (immigration, health, climate, AI governance); technical or factual tasks with clearer ground truth might behave differently. “Chaos” here is used in an empirical sense (positive Lyapunov exponent from 18-point time series), not the strict mathematical sense of a proven chaotic attractor — the estimation is necessarily approximate with 20 replicates over 20 rounds. Most importantly, the paper explicitly doesn’t link instability to decision quality. It’s possible that chaotic trajectories still converge on equally good decisions — the instability might be in the path, not the destination. That question remains open.
Still, the framing is valuable. If you’re building multi-agent governance systems, “does this produce good decisions?” is necessary but not sufficient. You also need “does this produce reproducible decisions?” This paper gives you a way to measure that, and shows that the answer is often no.