SWE-bench Verified

Coding

SWE-bench Verified (resolved rate)

500 human-validated, real GitHub issues from popular Python repositories. The model must produce a patch that makes the repo's test suite pass. The single most-watched benchmark for agentic coding ability and the best public proxy for how useful a model is inside coding agents.

~ marks community-reported or version-normalized figures; all others come from official model cards. Prices shown as input/output per 1M tokens. Updated 2026-06-10.