HLE

Reasoning

Humanity's Last Exam (no tools)

Around 2,500 extremely hard expert-written questions across more than a hundred subjects, designed as the final closed-ended academic benchmark. Scores reported without tool use. Even frontier models score far below expert level, leaving real headroom.

1Gemini 3 ProGoogle$2/$1237.5%2GPT-5.5OpenAI$5/$30~35%3GPT-5.1OpenAI$1.25/$10~26.5%4Grok 4xAI$3/$1525.4%5GPT-5OpenAI$1.25/$1024.8%6Kimi K2 ThinkingMoonshot AI$0.6/$2.523.9%7Gemini 2.5 ProGoogle$1.25/$1021.6%8OpenAI o3OpenAI$2/$820.3%9DeepSeek V3.2DeepSeek$0.28/$0.42~19.8%10gpt-oss-120bOpenAI$0.1/$0.519%11Qwen3-235B-A22BAlibaba (Qwen)$0.22/$0.88~18.2%12DeepSeek R1 (0528)DeepSeek$0.55/$2.1917.7%13Claude Sonnet 4.5Anthropic$3/$15~17.3%14GLM-4.6Z.ai (Zhipu)$0.6/$2.2~17.2%

~ marks community-reported or version-normalized figures; all others come from official model cards. Prices shown as input/output per 1M tokens. Updated 2026-06-10.