Coding

SWE-bench Bash Only Leaderboard

SWE-bench Bash Only runs the SWE-bench Verified issues through a minimal, single-tool bash agent — no specialised scaffolding — so the score reflects the model's own agentic coding ability rather than the harness around it. A cleaner, harder read on raw software-engineering skill.

Source: swebench9 open models ranked+39 proprietaryData through Feb 2026

All models ranked on SWE-bench Bash Only

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1Claude 4.5 Opus (high reasoning) · proprietary
76.8%
2Gemini 3 Flash (high reasoning) · proprietary
75.8%
3MiniMax M2.5 (high reasoning) · proprietary
75.8%
4Claude Opus 4.6 · proprietary
75.6%
5Claude 4.5 Opus medium (20251101) · proprietary
74.4%
6Gemini 3 Pro Preview (2025-11-18) · proprietary
74.2%
7GLM-5 (high reasoning) · proprietary
72.8%
8GPT 5.2 Codex · proprietary
72.8%
9GPT-5-2 (high reasoning) · proprietary
72.8%
10GPT-5-2 Codex · proprietary
72.8%
11GPT-5.2 (2025-12-11) (high reasoning) · proprietary
71.8%
12Claude 4.5 Sonnet (high reasoning) · proprietary
71.4%
13Kimi K2.5 (high reasoning) · proprietary
70.8%
14Claude 4.5 Sonnet (20250929) · proprietary
70.6%
15DeepSeek V3.2 (high reasoning) · proprietary
70.0%
16Gemini 3 Pro · proprietary
69.6%
17GPT-5.2 (2025-12-11) · proprietary
69.0%
18Claude 4 Opus (20250514) · proprietary
67.6%
19Claude 4.5 Haiku (high reasoning) · proprietary
66.6%
20GPT-5.1 (2025-11-13) (medium reasoning) · proprietary
66.0%
21GPT-5.1-codex (medium reasoning) · proprietary
66.0%
22GPT-5 (2025-08-07) (medium reasoning) · proprietary
65.0%
23Claude 4 Sonnet (20250514) · proprietary
64.9%
24Kimi K2 Thinking · 1058.1B
63.4%
25MiniMax M2 · 228.7B
61.0%
26DeepSeek V3.2 · 685.4B
60.0%
27GPT-5 mini (2025-08-07) (medium reasoning) · proprietary
59.8%
28o3 (2025-04-16) · proprietary
58.4%
29devstral-small-2512 · proprietary
56.4%
30GPT-5 Mini · proprietary
56.2%
31GLM 4.6 · 356.8B
55.4%
32Qwen3 Coder 480B A35B Instruct · 480.2B
55.4%
33GLM 4.5 · 358.3B
54.2%
34devstral-2512 · proprietary
53.8%
35Gemini 2.5 Pro (2025-05-06) · proprietary
53.6%
36Claude 3.7 Sonnet (20250219) · proprietary
52.8%
37o4-mini (2025-04-16) · proprietary
45.0%
38Kimi K2 Instruct · 1026.5B
43.8%
39GPT-4.1 (2025-04-14) · proprietary
39.6%
40GPT-5 nano (2025-08-07) (medium reasoning) · proprietary
34.8%
41Gemini 2.5 Flash (2025-04-17) · proprietary
28.7%
42GPT OSS 120B · 120.4B
26.0%
43GPT-4.1-mini (2025-04-14) · proprietary
23.9%
44GPT-4o (2024-11-20) · proprietary
21.6%
45Llama 4 Maverick Instruct · proprietary
21.0%
46Gemini 2.0 flash · proprietary
13.5%
47Llama 4 Scout Instruct · proprietary
9.1%
48Qwen2.5 Coder 32B Instruct · 32.8B
9.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

100B1Tmodel size (log scale) →63.4%9.0%DeepSeek V3.2 · 685B · 60.0%Qwen3 Coder 480B A35B Instruct · 480B · 55.4%GLM 4.6 · 357B · 55.4%GLM 4.5 · 358B · 54.2%Kimi K2 Instruct · 1T · 43.8%Qwen2.5 Coder 32B Instruct · 33B · 9.0%Qwen2.5 Coder 32B Ins…GPT OSS 120B · 120B · 26.0%GPT OSS 120BMiniMax M2 · 229B · 61.0%MiniMax M2Kimi K2 Thinking · 1.1T · 63.4%Kimi K2 Thinking
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Qwen2.5 Coder 32B Instruct, 33B, score 9.0% — on the efficiency frontier (best score at its size or smaller).
  • GPT OSS 120B, 120B, score 26.0% — on the efficiency frontier (best score at its size or smaller).
  • MiniMax M2, 229B, score 61.0% — on the efficiency frontier (best score at its size or smaller).
  • Kimi K2 Thinking, 1.1T, score 63.4% — on the efficiency frontier (best score at its size or smaller).

SWE-bench Bash Only: frequently asked questions

What is the best open LLM on SWE-bench Bash Only?
Kimi K2 Thinking is the top open model on SWE-bench Bash Only, scoring 63.4%. Among all models tested — including proprietary ones — it ranks #24. The top model overall is Claude 4.5 Opus (high reasoning) (Anthropic) at 76.8%.
What's the best SWE-bench Bash Only model you can run on a 24 GB GPU?
Qwen2.5 Coder 32B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 9.0% on SWE-bench Bash Only.
Can open models match proprietary models on SWE-bench Bash Only?
Not quite on SWE-bench Bash Only: the strongest proprietary model (Claude 4.5 Opus (high reasoning)) scores 76.8%, ahead of the best open model (Kimi K2 Thinking) at 63.4% — but you can run the open one yourself.

Scores aggregated from swebench. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.

SWE-bench Bash Only Leaderboard — LLM Scores | llmrun