Terminal-Bench
AgenticTerminal-Bench (agentic terminal tasks)
End-to-end tasks an engineer would do in a real terminal: building code, wrangling servers, debugging environments. The model operates a shell autonomously until the task is done. Strong predictor of performance inside CLI coding agents.
1Claude Opus 4.8Anthropic~63%2Claude Opus 4.5Anthropic59.3%3Gemini 3 ProGoogle54.2%4Claude Sonnet 4.5Anthropic50%5GPT-5.1OpenAI~47.6%6Kimi K2 ThinkingMoonshot AI~47.1%7MiniMax M2MiniMax46.3%8GPT-5OpenAI~43.8%9Claude Haiku 4.5Anthropic~41%10GLM-4.6Z.ai (Zhipu)40.5%
~ marks community-reported or version-normalized figures; all others come from official model cards. Prices shown as input/output per 1M tokens. Updated 2026-06-10.