What is the best open LLM on SWE-bench Bash Only?

Kimi K2 Thinking is the top open model on SWE-bench Bash Only, scoring 63.4%. Among all models tested — including proprietary ones — it ranks #24. The top model overall is Claude 4.5 Opus (high reasoning) (Anthropic) at 76.8%.

What's the best SWE-bench Bash Only model you can run on a 24 GB GPU?

Qwen2.5 Coder 32B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 9.0% on SWE-bench Bash Only.

Can open models match proprietary models on SWE-bench Bash Only?

Not quite on SWE-bench Bash Only: the strongest proprietary model (Claude 4.5 Opus (high reasoning)) scores 76.8%, ahead of the best open model (Kimi K2 Thinking) at 63.4% — but you can run the open one yourself.

Coding

SWE-bench Bash Only Leaderboard

Name: SWE-bench Bash Only — open LLM scores
Creator: swebench

SWE-bench Bash Only runs the SWE-bench Verified issues through a minimal, single-tool bash agent — no specialised scaffolding — so the score reflects the model's own agentic coding ability rather than the harness around it. A cleaner, harder read on raw software-engineering skill.

Source: swebench9 open models ranked+39 proprietaryData through Feb 2026

Open models All models

Open models ranked on SWE-bench Bash Only

# shows rank among open models / rank overall (including proprietary).

#	Model	Score
1 / 24	Kimi K2 Thinking · 1058.1B	63.4%
2 / 25	MiniMax M2 · 228.7B	61.0%
3 / 26	DeepSeek V3.2 · 685.4B	60.0%
4 / 31	GLM 4.6 · 356.8B	55.4%
5 / 32	Qwen3 Coder 480B A35B Instruct · 480.2B	55.4%
6 / 33	GLM 4.5 · 358.3B	54.2%
7 / 38	Kimi K2 Instruct · 1026.5B	43.8%
8 / 42	GPT OSS 120B · 120.4B	26.0%
9 / 48	Qwen2.5 Coder 32B Instruct · 32.8B	9.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

SWE-bench Bash Only: frequently asked questions

What is the best open LLM on SWE-bench Bash Only?: Kimi K2 Thinking is the top open model on SWE-bench Bash Only, scoring 63.4%. Among all models tested — including proprietary ones — it ranks #24. The top model overall is Claude 4.5 Opus (high reasoning) (Anthropic) at 76.8%.
What's the best SWE-bench Bash Only model you can run on a 24 GB GPU?: Qwen2.5 Coder 32B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 9.0% on SWE-bench Bash Only.
Can open models match proprietary models on SWE-bench Bash Only?: Not quite on SWE-bench Bash Only: the strongest proprietary model (Claude 4.5 Opus (high reasoning)) scores 76.8%, ahead of the best open model (Kimi K2 Thinking) at 63.4% — but you can run the open one yourself.

Scores aggregated from swebench. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.