SWE-bench Verified

Coding

SWE-bench Verified (resolved rate)

500 human-validated, real GitHub issues from popular Python repositories. The model must produce a patch that makes the repo's test suite pass. The single most-watched benchmark for agentic coding ability and the best public proxy for how useful a model is inside coding agents.

1Claude Fable 5Anthropic$10/$50~95%2Claude Opus 4.8Anthropic$5/$2588.6%3GPT-5.5OpenAI$5/$3082.6%4Kimi K2.6Moonshot AI$0.95/$4~82.5%5Claude Opus 4.7Anthropic$5/$2582%6Claude Opus 4.6Anthropic$5/$25~81.4%7DeepSeek V4DeepSeek$0.3/$0.5~81%8Claude Opus 4.5Anthropic$5/$2580.9%9Claude Sonnet 4.6Anthropic$3/$15~79.6%10Gemini 3.5 FlashGoogle$1.5/$978.8%11Gemini 3.1 ProGoogle$2/$12~78%12GPT-5.2OpenAI$1.75/$14~77.9%13GLM-5Z.ai (Zhipu)$1/$3.2~77.8%14Claude Sonnet 4.5Anthropic$3/$1577.2%15GPT-5.1OpenAI$1.25/$1076.3%16Gemini 3 ProGoogle$2/$1276.2%17GPT-5OpenAI$1.25/$1074.9%18Claude Opus 4.1Anthropic$15/$7574.5%19Claude Haiku 4.5Anthropic$1/$573.3%20Kimi K2 ThinkingMoonshot AI$0.6/$2.571.3%21GPT-5 miniOpenAI$0.25/$2~71%22Grok Code Fast 1xAI$0.2/$1.570.8%23Qwen3-Coder-NextAlibaba (Qwen)$0.11/$0.870.6%24Qwen3-MaxAlibaba (Qwen)$1.2/$669.6%25MiniMax M2MiniMax$0.3/$1.269.4%26OpenAI o3OpenAI$2/$869.1%27OpenAI o4-miniOpenAI$1.1/$4.468.1%28GLM-4.6Z.ai (Zhipu)$0.6/$2.268%29DeepSeek V3.2DeepSeek$0.28/$0.4267.8%30Gemini 2.5 ProGoogle$1.25/$1063.8%31GPT-4.1OpenAI$2/$854.6%32Gemini 2.5 FlashGoogle$0.3/$2.5~48.9%33GPT-4oOpenAI$2.5/$1033.2%

~ marks community-reported or version-normalized figures; all others come from official model cards. Prices shown as input/output per 1M tokens. Updated 2026-06-10.