Leaderboards
Models ranked per benchmark, and the composite modhub Index — a plain average across every benchmark we track. Estimates are marked with ~.
modhub Index — overall standings
1OpenAI o4-miniOpenAI · 4 benchmarks80.92Qwen3-MaxAlibaba (Qwen) · 4 benchmarks80.53GPT-5.5OpenAI · 4 benchmarks77.24Gemini 3 ProGoogle · 6 benchmarks73.65Gemini 2.5 ProGoogle · 6 benchmarks70.96GPT-5OpenAI · 7 benchmarks70.77Qwen3-235B-A22BAlibaba (Qwen) · 4 benchmarks69.08OpenAI o3OpenAI · 5 benchmarks68.99MiniMax M2MiniMax · 4 benchmarks68.910DeepSeek V3.2DeepSeek · 5 benchmarks68.411DeepSeek R1 (0528)DeepSeek · 4 benchmarks67.812Kimi K2 ThinkingMoonshot AI · 6 benchmarks67.613Claude Sonnet 4.5Anthropic · 6 benchmarks65.514GLM-4.6Z.ai (Zhipu) · 6 benchmarks63.915GPT-5.1OpenAI · 4 benchmarks59.6
Models need results on at least 4 tracked benchmarks to qualify.
By benchmark
Coding
SWE-bench Verified
- 1Claude Fable 5~95%
- 2Claude Opus 4.888.6%
- 3GPT-5.582.6%
Reasoning
GPQA Diamond
- 1GPT-5.5~92%
- 2Gemini 3 Pro91.9%
- 3Claude Opus 4.8~91%
Math
AIME 2025
- 1GPT-5.5~99%
- 2Gemini 3 Pro95%
- 3GPT-594.6%
Reasoning
HLE
- 1Gemini 3 Pro37.5%
- 2GPT-5.5~35%
- 3GPT-5.1~26.5%
Agentic
Terminal-Bench
- 1Claude Opus 4.8~63%
- 2Claude Opus 4.559.3%
- 3Gemini 3 Pro54.2%
Multimodal
MMMU
- 1Gemini 3 Pro~87%
- 2GPT-584.2%
- 3OpenAI o382.9%
Knowledge
MMLU-Pro
- 1GPT-5~87%
- 2Gemini 2.5 Pro86.2%
- 3Qwen3-Max~85.2%