GPQA Diamond

Reasoning

Graduate-Level Google-Proof Q&A (Diamond subset)

198 PhD-level multiple-choice questions in biology, physics and chemistry, written so that even skilled non-experts with full internet access score poorly. Measures deep scientific reasoning rather than retrieval. Expert humans score around 70%.

~ marks community-reported or version-normalized figures; all others come from official model cards. Prices shown as input/output per 1M tokens. Updated 2026-06-10.