SearchSwarm
SearchSwarm trains a 30B-parameter agent to delegate complex research tasks to subagents, each gathering cited evidence in isolated contexts, then synthesizes final answers without holding the full process in memory.
SearchSwarm is a research project about teaching AI language models to handle complex, long-running research tasks more effectively. The core idea is training a main agent to break a large question into smaller pieces and hand those pieces off to helper agents called subagents. Each subagent works in its own isolated context, gathers relevant evidence, and returns a short, citation-backed report. The main agent then combines all those reports into a final answer without needing to hold the entire research process in memory at once.
The project includes a training pipeline to teach the main agent when to delegate work, how to give subagents clear instructions, and how to verify what they return. High-quality training data was built from cleaned agent trajectories that show the delegation process step by step. The result is a 30-billion-parameter model called SearchSwarm-30B-A3B, which the authors report achieves strong results compared to other open-source research agents of similar size on benchmarks like BrowseComp, GAIA, and xbench-DeepSearch.
The repository contains two main components: an evaluation framework and training scripts. The evaluation tool reads from a configuration file and supports two inference modes, one where the model is served by an external API-compatible endpoint, and one where it runs locally across eight GPU servers. Benchmark datasets are not bundled with the code; users need to obtain them from their official sources and convert them to a specific JSON format before pointing the tool at those files.
Training uses ms-swift's Megatron backend. The repository offers three launch paths for multi-node setups: a Ray-based option for cloud clusters, an SSH/torchrun path for traditional clusters, and a shared-filesystem path for schedulers like Kubernetes batch jobs. A single-GPU smoke test is also provided to validate the environment before running at full scale.
This repository is primarily a research artifact for people who want to reproduce the paper's results or extend the approach. It assumes access to significant GPU resources and familiarity with large-model training infrastructure.
Where it fits
- Reproduce or extend multi-agent research delegation benchmarks (BrowseComp, GAIA, xbench-DeepSearch).
- Evaluate large open-source research agents against API-compatible or local 8-GPU inference endpoints.
- Train custom delegation-aware agents using cleaned trajectory data and Megatron multi-node scripts.
- Explore subagent orchestration architectures for long-horizon question answering tasks.