GPQA Diamond
ReasoningGraduate-Level Google-Proof Q&A (Diamond subset)
198 PhD-level multiple-choice questions in biology, physics and chemistry, written so that even skilled non-experts with full internet access score poorly. Measures deep scientific reasoning rather than retrieval. Expert humans score around 70%.
1GPT-5.5OpenAI~92%2Gemini 3 ProGoogle91.9%3Claude Opus 4.8Anthropic~91%4GPT-5.1OpenAI~88.1%5Grok 4xAI87.5%6Claude Opus 4.5Anthropic~87%7Gemini 2.5 ProGoogle86.4%8GPT-5OpenAI85.7%9Qwen3-MaxAlibaba (Qwen)~85.4%10Kimi K2 ThinkingMoonshot AI84.5%11Claude Sonnet 4.5Anthropic83.4%12OpenAI o3OpenAI83.3%13GPT-5 miniOpenAI~82.3%14OpenAI o4-miniOpenAI81.4%15Qwen3-235B-A22BAlibaba (Qwen)81.1%16DeepSeek R1 (0528)DeepSeek81%17GLM-4.6Z.ai (Zhipu)~81%18gpt-oss-120bOpenAI80.1%19DeepSeek V3.2DeepSeek79.9%20Gemini 2.5 FlashGoogle~78.3%21MiniMax M2MiniMax~78%22Claude Haiku 4.5Anthropic~73%23Llama 4 MaverickMeta69.8%24GPT-4.1OpenAI66.3%25Llama 4 ScoutMeta57.2%26GPT-4oOpenAI53.6%
~ marks community-reported or version-normalized figures; all others come from official model cards. Prices shown as input/output per 1M tokens. Updated 2026-06-10.