What is the best open LLM on SWE-bench Verified?

Kimi K2 Instruct 0905 is the top open model on SWE-bench Verified, scoring 71.2%. Among all models tested — including proprietary ones — it ranks #32. The top model overall is live-SWE-agent + Claude 4.5 Opus medium (20251101) at 79.2%.

What's the best SWE-bench Verified model you can run on a 24 GB GPU?

Qwen3 Coder 30B A3B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 17 GB), scoring 60.4% on SWE-bench Verified.

Can open models match proprietary models on SWE-bench Verified?

Not quite on SWE-bench Verified: the strongest proprietary model (live-SWE-agent + Claude 4.5 Opus medium (20251101)) scores 79.2%, ahead of the best open model (Kimi K2 Instruct 0905) at 71.2% — but you can run the open one yourself.

Coding

SWE-bench Verified Leaderboard

Name: SWE-bench Verified — open LLM scores
Creator: swebench

SWE-bench Verified tests whether a model can resolve real GitHub issues from popular open-source Python projects, scored on the official swebench.com leaderboard as the percentage of human-validated issues actually fixed. It is the headline measure of practical, agentic software-engineering ability — where open-weight models like Qwen3-Coder, GLM-4.6, Kimi K2 and DeepSWE are now competitive with the frontier.

Source: swebench13 open models ranked+150 proprietaryData through Feb 2026

Open models All models

Open models ranked on SWE-bench Verified

# shows rank among open models / rank overall (including proprietary).

#	Model	Score
1 / 33	Kimi K2 Instruct 0905 · 1026.5B	71.2%
2 / 47	Qwen3 Coder 480B A35B Instruct · 480.2B	69.6%
3 / 49	GLM 4.6 · 356.8B	68.2%
4 / 60	Kimi K2 Instruct · 1026.5B	65.4%
5 / 65	GLM 4.5 · 358.3B	64.2%
6 / 67	Kimi K2 Thinking · 1058.1B	63.4%
7 / 73	MiniMax M2 · 228.7B	61.0%
8 / 75	Qwen3 Coder 30B A3B Instruct · 30.5B	60.4%
9 / 77	DeepSeek V3.2 · 685.4B	60.0%
10 / 79	DeepSWE Preview · 32.8B	58.8%
11 / 105	Qwen2.5 Coder 32B Instruct · 32.8B	47.0%
12 / 117	DeepSeek v3 0324 · 684.5B	42.0%
13 / 143	GPT OSS 120B · 120.4B	26.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

SWE-bench Verified: frequently asked questions

What is the best open LLM on SWE-bench Verified?: Kimi K2 Instruct 0905 is the top open model on SWE-bench Verified, scoring 71.2%. Among all models tested — including proprietary ones — it ranks #32. The top model overall is live-SWE-agent + Claude 4.5 Opus medium (20251101) at 79.2%.
What's the best SWE-bench Verified model you can run on a 24 GB GPU?: Qwen3 Coder 30B A3B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 17 GB), scoring 60.4% on SWE-bench Verified.
Can open models match proprietary models on SWE-bench Verified?: Not quite on SWE-bench Verified: the strongest proprietary model (live-SWE-agent + Claude 4.5 Opus medium (20251101)) scores 79.2%, ahead of the best open model (Kimi K2 Instruct 0905) at 71.2% — but you can run the open one yourself.

Scores aggregated from swebench. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.