GPQA Diamond

Reasoning

Graduate-Level Google-Proof Q&A (Diamond subset)

198 PhD-level multiple-choice questions in biology, physics and chemistry, written so that even skilled non-experts with full internet access score poorly. Measures deep scientific reasoning rather than retrieval. Expert humans score around 70%.

1GPT-5.5OpenAI$5/$30~92%2Gemini 3 ProGoogle$2/$1291.9%3Claude Opus 4.8Anthropic$5/$25~91%4GPT-5.1OpenAI$1.25/$10~88.1%5Grok 4xAI$3/$1587.5%6Claude Opus 4.5Anthropic$5/$25~87%7Gemini 2.5 ProGoogle$1.25/$1086.4%8GPT-5OpenAI$1.25/$1085.7%9Qwen3-MaxAlibaba (Qwen)$1.2/$6~85.4%10Kimi K2 ThinkingMoonshot AI$0.6/$2.584.5%11Claude Sonnet 4.5Anthropic$3/$1583.4%12OpenAI o3OpenAI$2/$883.3%13GPT-5 miniOpenAI$0.25/$2~82.3%14OpenAI o4-miniOpenAI$1.1/$4.481.4%15Qwen3-235B-A22BAlibaba (Qwen)$0.22/$0.8881.1%16DeepSeek R1 (0528)DeepSeek$0.55/$2.1981%17GLM-4.6Z.ai (Zhipu)$0.6/$2.2~81%18gpt-oss-120bOpenAI$0.1/$0.580.1%19DeepSeek V3.2DeepSeek$0.28/$0.4279.9%20Gemini 2.5 FlashGoogle$0.3/$2.5~78.3%21MiniMax M2MiniMax$0.3/$1.2~78%22Claude Haiku 4.5Anthropic$1/$5~73%23Llama 4 MaverickMeta$0.27/$0.8569.8%24GPT-4.1OpenAI$2/$866.3%25Llama 4 ScoutMeta$0.18/$0.5957.2%26GPT-4oOpenAI$2.5/$1053.6%

~ marks community-reported or version-normalized figures; all others come from official model cards. Prices shown as input/output per 1M tokens. Updated 2026-06-10.