Member-only story
Sonnet 4.5 is now SOTA on GAIA
Sonnet 4.5 is now SOTA on GAIA with 74.6% overall accuracy using the HAL Generalist Agent scaffold.
GAIA is a benchmark that stresses general AI assistant skills like browsing, tool use, multi-step reasoning, and multimodality.
Sonnet 4.5 beats Opus 4.1 and GPT-5 Medium while being cheaper to run.
What is even more impressive is how that performance is achieved. Most models tend to shine in level 1 tasks (bread-and-butter reasoning). In contrast, Sonnet 4.5 stays comparatively even, from level 1 to level 3 tasks:
- Level 1: 81 % (bread-and-butter reasoning)
- Level 2: 72 % (integrating tools, moderately complex)
- Level 3: 69 % (the really hard stuff: multi-hop reasoning + tool choreography)
This suggests Sonnet 4.5 is better at tool-conditioned reasoning and long-horizon problem solving.
Check out the full leaderboard:
