SWE-bench Verified
CodingSWE-bench Verified (resolved rate)
500 human-validated, real GitHub issues from popular Python repositories. The model must produce a patch that makes the repo's test suite pass. The single most-watched benchmark for agentic coding ability and the best public proxy for how useful a model is inside coding agents.
1Claude Fable 5Anthropic~95%2Claude Opus 4.8Anthropic88.6%3GPT-5.5OpenAI82.6%4Kimi K2.6Moonshot AI~82.5%5Claude Opus 4.7Anthropic82%6Claude Opus 4.6Anthropic~81.4%7DeepSeek V4DeepSeek~81%8Claude Opus 4.5Anthropic80.9%9Claude Sonnet 4.6Anthropic~79.6%10Gemini 3.5 FlashGoogle78.8%11Gemini 3.1 ProGoogle~78%12GPT-5.2OpenAI~77.9%13GLM-5Z.ai (Zhipu)~77.8%14Claude Sonnet 4.5Anthropic77.2%15GPT-5.1OpenAI76.3%16Gemini 3 ProGoogle76.2%17GPT-5OpenAI74.9%18Claude Opus 4.1Anthropic74.5%19Claude Haiku 4.5Anthropic73.3%20Kimi K2 ThinkingMoonshot AI71.3%21GPT-5 miniOpenAI~71%22Grok Code Fast 1xAI70.8%23Qwen3-Coder-NextAlibaba (Qwen)70.6%24Qwen3-MaxAlibaba (Qwen)69.6%25MiniMax M2MiniMax69.4%26OpenAI o3OpenAI69.1%27OpenAI o4-miniOpenAI68.1%28GLM-4.6Z.ai (Zhipu)68%29DeepSeek V3.2DeepSeek67.8%30Gemini 2.5 ProGoogle63.8%31GPT-4.1OpenAI54.6%32Gemini 2.5 FlashGoogle~48.9%33GPT-4oOpenAI33.2%
~ marks community-reported or version-normalized figures; all others come from official model cards. Prices shown as input/output per 1M tokens. Updated 2026-06-10.