Hermes Agent’s MoA presets outperform Claude Opus 4.8 and GPT-5.5 in new benchmarks

1 hour ago 19

Open-source AI just sent a message to the big labs. Nous Research’s Hermes Agent, launched in February 2026, is claiming benchmark wins over two of the most prominent proprietary models in the space: Anthropic’s Claude Opus 4.8 and OpenAI’s GPT-5.5.

The vehicle for those wins is something called Mixture of Agents presets, a framework feature that strings multiple language models together rather than relying on any single one. Instead of asking one very smart model a question, you ask several models, then have an aggregator layer synthesize the best answer from all of them. The result, according to Nous Research, outperforms either model working alone.

What Hermes Agent actually does differently

Hermes Agent’s Mixture of Agents approach uses an aggregator model that synthesizes responses from reference models, improving the handling of complex, multi-step tasks. Agentic workflows, the kind where an AI needs to plan, execute, course-correct, and repeat, tend to expose the weaknesses of standalone models quickly. Hermes Agent is built around exactly that use case.

Beyond the MoA presets, Hermes Agent ships with persistent memory that maintains context across sessions, rather than starting fresh every time a conversation ends. It also includes a built-in learning loop, meaning the system can create new skills from experience over time. Tool integration rounds out the feature set, allowing the agent to connect to external systems and APIs.

Where the benchmark numbers actually stand

On SWE-bench Pro, a coding-focused evaluation that tests how well models handle real software engineering tasks, Opus 4.8 scored 69.2% against GPT-5.5’s 58.6%. Hermes Agent’s MoA presets are claiming to clear both of those bars. The full leaderboard, which will rank models across a broader set of performance metrics, has not been published yet.

Community benchmarking has already been active in the background. Hermes Agent has seen meaningful GitHub activity since launch, and integration with platforms like OpenRouter suggests real adoption beyond internal testing.

Mixture of Agents setups are running multiple models simultaneously, which means the compute cost per query is higher than a single-model call. Nous Research’s framing leans into cost efficiency for specialized tasks: purpose-built, task-specific deployments where you can optimize the model mix rather than paying for general-purpose capability you do not always need.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

Read Entire Article