Coding
SWE-bench Verified Leaderboard
SWE-bench Verified tests whether a model can resolve real GitHub issues from popular open-source Python projects. It is a human-validated subset focused on realistic software-engineering tasks.
Source: epoch2 open models ranked+28 proprietaryData through May 2026
All models ranked on SWE-bench Verified
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | claude-opus-4-7_max · proprietary | 83.5% |
| 2 | gpt-5.5-pre-release_xhigh · proprietary | 80.6% |
| 3 | gemini-3.5-flash_high · proprietary | 79.3% |
| 4 | gpt-5.4-2026-03-05_high · proprietary | 76.9% |
| 5 | claude-opus-4-5-20251101 · proprietary | 76.6% |
| 6 | kimi-k2.6 · proprietary | 76.6% |
| 7 | qwen3.6-max-preview · proprietary | 76.6% |
| 8 | claude-opus-4-6 · proprietary | 75.6% |
| 9 | gemini-3.1-pro-preview-customtools · proprietary | 75.6% |
| 10 | gemini-3-flash-preview · proprietary | 75.4% |
| 11 | claude-sonnet-4-6 · proprietary | 75.2% |
| 12 | gpt-5.3-codex_high · proprietary | 74.8% |
| 13 | GLM 5.1 · 753.9B | 74.2% |
| 14 | gpt-5.2-2025-12-11_high · proprietary | 73.8% |
| 15 | kimi-k2.5 · proprietary | 73.8% |
| 16 | gpt-5-2025-08-07_high · proprietary | 73.6% |
| 17 | claude-opus-4-1-20250805 · proprietary | 73.4% |
| 18 | gemini-3-pro-preview · proprietary | 72.9% |
| 19 | GLM 5 · 753.9B | 72.1% |
| 20 | gpt-5-2025-08-07_medium · proprietary | 71.5% |
| 21 | claude-sonnet-4-5-20250929 · proprietary | 71.3% |
| 22 | claude-opus-4-20250514 · proprietary | 70.7% |
| 23 | gpt-5.1-2025-11-13_high · proprietary | 65.9% |
| 24 | gpt-5-mini-2025-08-07_medium · proprietary | 64.7% |
| 25 | o3-2025-04-16_medium · proprietary | 62.3% |
| 26 | claude-3-7-sonnet-20250219 · proprietary | 61.0% |
| 27 | qwen3.6-plus · proprietary | 57.9% |
| 28 | gemini-2.5-pro · proprietary | 57.6% |
| 29 | gpt-4.1-2025-04-14 · proprietary | 48.5% |
| 30 | gpt-4o-2024-11-20 · proprietary | 31.0% |
SWE-bench Verified: frequently asked questions
- What is the best open LLM on SWE-bench Verified?
- GLM 5.1 is the top open model on SWE-bench Verified, scoring 74.2%. Among all models tested — including proprietary ones — it ranks #13.
- Can open models match proprietary models on SWE-bench Verified?
- Not quite on SWE-bench Verified: the strongest proprietary model (claude-opus-4-7_max) scores 83.5%, ahead of the best open model (GLM 5.1) at 74.2% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.