Coding

SWE-bench Verified Leaderboard

SWE-bench Verified tests whether a model can resolve real GitHub issues from popular open-source Python projects. It is a human-validated subset focused on realistic software-engineering tasks.

Source: epoch2 open models ranked+28 proprietaryData through May 2026

All models ranked on SWE-bench Verified

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1claude-opus-4-7_max · proprietary
83.5%
2gpt-5.5-pre-release_xhigh · proprietary
80.6%
3gemini-3.5-flash_high · proprietary
79.3%
4gpt-5.4-2026-03-05_high · proprietary
76.9%
5claude-opus-4-5-20251101 · proprietary
76.6%
6kimi-k2.6 · proprietary
76.6%
7qwen3.6-max-preview · proprietary
76.6%
8claude-opus-4-6 · proprietary
75.6%
9gemini-3.1-pro-preview-customtools · proprietary
75.6%
10gemini-3-flash-preview · proprietary
75.4%
11claude-sonnet-4-6 · proprietary
75.2%
12gpt-5.3-codex_high · proprietary
74.8%
13GLM 5.1 · 753.9B
74.2%
14gpt-5.2-2025-12-11_high · proprietary
73.8%
15kimi-k2.5 · proprietary
73.8%
16gpt-5-2025-08-07_high · proprietary
73.6%
17claude-opus-4-1-20250805 · proprietary
73.4%
18gemini-3-pro-preview · proprietary
72.9%
19GLM 5 · 753.9B
72.1%
20gpt-5-2025-08-07_medium · proprietary
71.5%
21claude-sonnet-4-5-20250929 · proprietary
71.3%
22claude-opus-4-20250514 · proprietary
70.7%
23gpt-5.1-2025-11-13_high · proprietary
65.9%
24gpt-5-mini-2025-08-07_medium · proprietary
64.7%
25o3-2025-04-16_medium · proprietary
62.3%
26claude-3-7-sonnet-20250219 · proprietary
61.0%
27qwen3.6-plus · proprietary
57.9%
28gemini-2.5-pro · proprietary
57.6%
29gpt-4.1-2025-04-14 · proprietary
48.5%
30gpt-4o-2024-11-20 · proprietary
31.0%

SWE-bench Verified: frequently asked questions

What is the best open LLM on SWE-bench Verified?
GLM 5.1 is the top open model on SWE-bench Verified, scoring 74.2%. Among all models tested — including proprietary ones — it ranks #13.
Can open models match proprietary models on SWE-bench Verified?
Not quite on SWE-bench Verified: the strongest proprietary model (claude-opus-4-7_max) scores 83.5%, ahead of the best open model (GLM 5.1) at 74.2% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.