Abstract
As Multi-Agent Systems (MAS) become increasingly autonomous and complex, understanding their error modes is critical for ensuring their reliability and safety. However, research in this area has been severely hampered by the lack of large-scale, diverse datasets with precise, ground-truth error labels. To address this bottleneck, we introduce Aegis, a novel framework for Automated Error Generation and attribution for Multi-Agent Systems.
By systematically injecting controllable and traceable errors into initially successful trajectories, we create a rich dataset of realistic failures. This is achieved using a context-aware, LLM-based adaptive manipulator that performs sophisticated attacks like prompt injection and response corruption to induce specific, predefined error modes. We demonstrate the value of our dataset by exploring three distinct learning paradigms for the error attribution task: Supervised Fine-Tuning, Reinforcement Learning, and Contrastive Learning.
Method Overview
Aegis follows a principled three-stage pipeline to generate high-quality error data and supports multiple learning paradigms for robust error attribution.
The Aegis framework automatically generates labeled failures by taking successful multi-agent trajectories and applying controlled, context-aware error injections, enabling three distinct learning methods for error attribution.
Results
Performance Highlights
- Aegis-SFT achieves 26.51 average score
- 2× improvement over base models
- 9,533 trajectories with 24,843 error instances
- 6 MAS frameworks × 6 task domains
Performance by Domain
Aegis-SFT (orange) consistently outperforms all baseline models across different task domains and MAS frameworks.
Complete Results on Aegis-Bench
| Model | Pair | Agent | Error | Avg. | |||
|---|---|---|---|---|---|---|---|
| μF1 | MF1 | μF1 | MF1 | μF1 | MF1 | ||
| Random Baseline | 0.33 | 0.21 | 4.54 | 3.56 | 11.23 | 11.15 | 4.08 |
| Small-Scale Models | |||||||
| DCL (Ours) | 8.33 | 5.30 | 22.93 | 20.23 | 24.73 | 27.70 | 12.61 |
| Medium-Scale Models | |||||||
| Qwen2.5-7B-Instruct | 5.02 | 2.52 | 27.55 | 14.49 | 14.96 | 11.36 | 12.43 |
| + SFT | 5.05 | 2.80 | 60.03 | 22.70 | 19.61 | 16.90 | 17.99 |
| + GRPO | 7.11 | 2.77 | 35.43 | 14.86 | 17.21 | 10.54 | 14.87 |
| Qwen2.5-14B-Instruct | 5.47 | 2.20 | 35.78 | 12.71 | 20.24 | 5.91 | 13.99 |
| + SFT (Aegis-SFT) | 16.62 | 9.99 | 76.53 | 47.97 | 27.53 | 27.66 | 26.51 |
| + GRPO (Aegis-GRPO) | 6.84 | 2.55 | 49.74 | 18.38 | 21.19 | 16.10 | 18.41 |
| Qwen3-8B-Non-Thinking | 3.96 | 1.40 | 21.34 | 8.16 | 15.81 | 13.89 | 10.12 |
| + SFT | 9.68 | 5.73 | 64.79 | 38.96 | 20.37 | 20.36 | 21.41 |
| + GRPO | 6.94 | 2.82 | 45.91 | 17.39 | 20.89 | 15.15 | 17.15 |
| Qwen3-8B-Thinking | 4.42 | 1.52 | 34.63 | 9.01 | 17.48 | 14.31 | 13.06 |
| + GRPO | 4.41 | 1.66 | 36.11 | 15.73 | 17.94 | 12.03 | 17.58 |
| Large-Scale Models | |||||||
| Qwen2.5-72B-Instruct | 5.60 | 2.20 | 37.46 | 14.51 | 17.72 | 16.58 | 15.01 |
| gpt-oss-120b | 6.53 | 1.71 | 38.58 | 5.53 | 20.38 | 12.05 | 17.07 |
| GPT-4.1 | 7.44 | 2.27 | 37.48 | 11.12 | 20.65 | 15.75 | 15.27 |
| GPT-4o-mini | 5.76 | 1.63 | 38.54 | 14.72 | 19.95 | 16.02 | 15.83 |
| o3 | 7.86 | 2.27 | 40.31 | 23.27 | 22.37 | 16.76 | 20.24 |
| Gemini-2.5-Flash | 6.99 | 2.76 | 42.02 | 16.45 | 23.47 | 19.85 | 19.55 |
| Gemini-2.5-Pro | 6.96 | 2.88 | 41.32 | 16.15 | 19.93 | 16.29 | 18.35 |
| Claude-Sonnet-4 | 7.68 | 2.34 | 40.73 | 15.51 | 21.21 | 16.55 | 18.16 |
Resources
Paper
Read the full paper on arXiv with detailed methodology and comprehensive experiments
Code
Access the complete codebase, including data generation pipeline and training scripts
Dataset
Download the Aegis dataset with 9,533 annotated error trajectories
Models
Pre-trained models on Hugging Face for error attribution in multi-agent systems
Benchmark
Aegis-Bench evaluation suite for systematic MAS error attribution
Documentation
Comprehensive guide for using Aegis in your own multi-agent systems
Citation
If you find Aegis useful for your research, please cite our paper:
Click here to copy citation to clipboard