Coding
SWE-bench Bash Only Leaderboard
SWE-bench Bash Only runs the SWE-bench Verified issues through a minimal, single-tool bash agent — no specialised scaffolding — so the score reflects the model's own agentic coding ability rather than the harness around it. A cleaner, harder read on raw software-engineering skill.
Source: swebench9 open models ranked+39 proprietaryData through Feb 2026
All models ranked on SWE-bench Bash Only
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | Claude 4.5 Opus (high reasoning) · proprietary | 76.8% |
| 2 | Gemini 3 Flash (high reasoning) · proprietary | 75.8% |
| 3 | MiniMax M2.5 (high reasoning) · proprietary | 75.8% |
| 4 | Claude Opus 4.6 · proprietary | 75.6% |
| 5 | Claude 4.5 Opus medium (20251101) · proprietary | 74.4% |
| 6 | Gemini 3 Pro Preview (2025-11-18) · proprietary | 74.2% |
| 7 | GLM-5 (high reasoning) · proprietary | 72.8% |
| 8 | GPT 5.2 Codex · proprietary | 72.8% |
| 9 | GPT-5-2 (high reasoning) · proprietary | 72.8% |
| 10 | GPT-5-2 Codex · proprietary | 72.8% |
| 11 | GPT-5.2 (2025-12-11) (high reasoning) · proprietary | 71.8% |
| 12 | Claude 4.5 Sonnet (high reasoning) · proprietary | 71.4% |
| 13 | Kimi K2.5 (high reasoning) · proprietary | 70.8% |
| 14 | Claude 4.5 Sonnet (20250929) · proprietary | 70.6% |
| 15 | DeepSeek V3.2 (high reasoning) · proprietary | 70.0% |
| 16 | Gemini 3 Pro · proprietary | 69.6% |
| 17 | GPT-5.2 (2025-12-11) · proprietary | 69.0% |
| 18 | Claude 4 Opus (20250514) · proprietary | 67.6% |
| 19 | Claude 4.5 Haiku (high reasoning) · proprietary | 66.6% |
| 20 | GPT-5.1 (2025-11-13) (medium reasoning) · proprietary | 66.0% |
| 21 | GPT-5.1-codex (medium reasoning) · proprietary | 66.0% |
| 22 | GPT-5 (2025-08-07) (medium reasoning) · proprietary | 65.0% |
| 23 | Claude 4 Sonnet (20250514) · proprietary | 64.9% |
| 24 | Kimi K2 Thinking · 1058.1B | 63.4% |
| 25 | MiniMax M2 · 228.7B | 61.0% |
| 26 | DeepSeek V3.2 · 685.4B | 60.0% |
| 27 | GPT-5 mini (2025-08-07) (medium reasoning) · proprietary | 59.8% |
| 28 | o3 (2025-04-16) · proprietary | 58.4% |
| 29 | devstral-small-2512 · proprietary | 56.4% |
| 30 | GPT-5 Mini · proprietary | 56.2% |
| 31 | GLM 4.6 · 356.8B | 55.4% |
| 32 | Qwen3 Coder 480B A35B Instruct · 480.2B | 55.4% |
| 33 | GLM 4.5 · 358.3B | 54.2% |
| 34 | devstral-2512 · proprietary | 53.8% |
| 35 | Gemini 2.5 Pro (2025-05-06) · proprietary | 53.6% |
| 36 | Claude 3.7 Sonnet (20250219) · proprietary | 52.8% |
| 37 | o4-mini (2025-04-16) · proprietary | 45.0% |
| 38 | Kimi K2 Instruct · 1026.5B | 43.8% |
| 39 | GPT-4.1 (2025-04-14) · proprietary | 39.6% |
| 40 | GPT-5 nano (2025-08-07) (medium reasoning) · proprietary | 34.8% |
| 41 | Gemini 2.5 Flash (2025-04-17) · proprietary | 28.7% |
| 42 | GPT OSS 120B · 120.4B | 26.0% |
| 43 | GPT-4.1-mini (2025-04-14) · proprietary | 23.9% |
| 44 | GPT-4o (2024-11-20) · proprietary | 21.6% |
| 45 | Llama 4 Maverick Instruct · proprietary | 21.0% |
| 46 | Gemini 2.0 flash · proprietary | 13.5% |
| 47 | Llama 4 Scout Instruct · proprietary | 9.1% |
| 48 | Qwen2.5 Coder 32B Instruct · 32.8B | 9.0% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Qwen2.5 Coder 32B Instruct, 33B, score 9.0% — on the efficiency frontier (best score at its size or smaller).
- GPT OSS 120B, 120B, score 26.0% — on the efficiency frontier (best score at its size or smaller).
- MiniMax M2, 229B, score 61.0% — on the efficiency frontier (best score at its size or smaller).
- Kimi K2 Thinking, 1.1T, score 63.4% — on the efficiency frontier (best score at its size or smaller).
SWE-bench Bash Only: frequently asked questions
- What is the best open LLM on SWE-bench Bash Only?
- Kimi K2 Thinking is the top open model on SWE-bench Bash Only, scoring 63.4%. Among all models tested — including proprietary ones — it ranks #24. The top model overall is Claude 4.5 Opus (high reasoning) (Anthropic) at 76.8%.
- What's the best SWE-bench Bash Only model you can run on a 24 GB GPU?
- Qwen2.5 Coder 32B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 9.0% on SWE-bench Bash Only.
- Can open models match proprietary models on SWE-bench Bash Only?
- Not quite on SWE-bench Bash Only: the strongest proprietary model (Claude 4.5 Opus (high reasoning)) scores 76.8%, ahead of the best open model (Kimi K2 Thinking) at 63.4% — but you can run the open one yourself.
Scores aggregated from swebench. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.