HLE
ReasoningHumanity's Last Exam (no tools)
Around 2,500 extremely hard expert-written questions across more than a hundred subjects, designed as the final closed-ended academic benchmark. Scores reported without tool use. Even frontier models score far below expert level, leaving real headroom.
1Gemini 3 ProGoogle37.5%2GPT-5.5OpenAI~35%3GPT-5.1OpenAI~26.5%4Grok 4xAI25.4%5GPT-5OpenAI24.8%6Kimi K2 ThinkingMoonshot AI23.9%7Gemini 2.5 ProGoogle21.6%8OpenAI o3OpenAI20.3%9DeepSeek V3.2DeepSeek~19.8%10gpt-oss-120bOpenAI19%11Qwen3-235B-A22BAlibaba (Qwen)~18.2%12DeepSeek R1 (0528)DeepSeek17.7%13Claude Sonnet 4.5Anthropic~17.3%14GLM-4.6Z.ai (Zhipu)~17.2%
~ marks community-reported or version-normalized figures; all others come from official model cards. Prices shown as input/output per 1M tokens. Updated 2026-06-10.