Performance Data — Updated June 2026
Sakana Fugu Benchmarks: How It Stacks Up Against Frontier Models
Detailed Sakana Fugu benchmark results across engineering, coding, science, reasoning, and mathematics — compared head-to-head against GPT-5.5, Claude Opus 4.8, and Gemini 3.1 Pro.
Full Sakana Fugu Benchmark Table
Sakana Fugu and Sakana Fugu Ultra scores compared against the three leading frontier models. Higher scores indicate better performance on each benchmark.
| Benchmark | Category | Fugu | Fugu Ultra | GPT-5.5 | Opus 4.8 | Gemini 3.1 |
|---|---|---|---|---|---|---|
| SWE-Bench Pro | Engineering | 59.0 | 73.7 | 69.2 | 62.5 | 54.2 |
| LiveCodeBench | Coding | 92.9 | 93.2 | 88.5 | 85.3 | 86.1 |
| GPQA-D | Science | 92.0 | 95.5 | 92.0 | 94.3 | 92.8 |
| Humanity's Last Exam | Reasoning | 47.2 | 50.0 | 49.8 | 41.4 | 43.6 |
| MATH-500 | Mathematics | 98.6 | 99.0 | 97.8 | 96.4 | 97.2 |
| AIME 2025 | Mathematics | 86.7 | 90.0 | 86.7 | 83.3 | 85.0 |
Why Sakana Fugu Outperforms Single Models
The key insight behind Sakana Fugu's benchmark dominance is specialization through coordination. No single frontier model excels at everything — GPT-5.5 is strong at code generation, Claude Opus excels at long-context reasoning, Gemini leads on multimodal tasks. Sakana Fugu's conductor model has learned which model to activate for which sub-task, achieving best-of-all performance through intelligent delegation.
On SWE-Bench Pro, the gap is most striking: Sakana Fugu Ultra scores 73.7 versus 69.2 for GPT-5.5 alone. This 4.5-point improvement represents a generational leap — the kind of gain that normally requires training an entirely new, larger model. Sakana Fugu achieves this by having different agents handle code understanding, solution generation, and verification as separate coordinated steps.
The Sakana Fugu advantage is even more pronounced on agentic, multi-step tasks. In AutoResearch benchmarks, Sakana Fugu Ultra ran 123 experiments over 14 hours autonomously. In trading benchmarks, it achieved +19.43% portfolio returns versus less than 15% for any single model. These results demonstrate that Sakana Fugu's orchestration scales with task complexity — the harder the problem, the more the multi-agent approach pays off.
Sakana Fugu Real-World Performance
Beyond standard benchmarks, Sakana Fugu has demonstrated exceptional performance on practical, multi-step tasks that reflect real-world AI usage:
Code Review Depth
20+ issues found
vs 3 issues by competing models
Sakana Fugu Ultra catches bugs across multiple categories simultaneously by assigning different review agents to different concern areas.
Research Automation
123 experiments in 14 hours
Fully autonomous
Sakana Fugu Ultra optimized model training recipes by running and evaluating experiments without human intervention.
Rubik's Cube Solver
300/300 cubes solved
vs crashes from competitors
Sakana Fugu generated a functional solver that handled all test cases, while competing single models produced crashing code.
Trading Strategy
+19.43% mean return
vs <15% for single models
Sakana Fugu Ultra's multi-agent coordination produced superior portfolio optimization strategies.