Coding
Aider Polyglot Leaderboard
The Aider Polyglot benchmark measures real-world coding across several programming languages: the model edits code to solve Exercism exercises, and is scored on whether the final solution actually runs and passes the tests.
Source: epoch12 open models ranked+59 proprietaryData through Dec 2025
Open models ranked on Aider Polyglot
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 19 | DeepSeek R1 0528 · 684.5B | 71.4% |
| 2 / 28 | Qwen3 235B A22B · 235.1B | 59.6% |
| 3 / 29 | Qwen3 235B A22B Instruct 2507 · 235.1B | 59.6% |
| 4 / 30 | Kimi K2 Instruct · 1026.5B | 59.1% |
| 5 / 32 | DeepSeek R1 · 684.5B | 56.9% |
| 6 / 34 | DeepSeek v3 0324 · 684.5B | 55.1% |
| 7 / 48 | Qwen3 32B · 32.8B | 40.0% |
| 8 / 59 | QwQ 32B · 32.8B | 20.9% |
| 9 / 63 | Qwen2.5 Coder 32B Instruct · 32.8B | 16.4% |
| 10 / 66 | C4ai Command A 03 2025 · 111.1B | 12.0% |
| 11 / 68 | Openhands Lm 32B v0.1 · 32.8B | 10.2% |
| 12 / 70 | Gemma 3 27B IT · 27.4B | 4.9% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Gemma 3 27B IT, 27B, score 4.9% — on the efficiency frontier (best score at its size or smaller).
- Qwen3 32B, 33B, score 40.0% — on the efficiency frontier (best score at its size or smaller).
- Qwen3 235B A22B, 235B, score 59.6% — on the efficiency frontier (best score at its size or smaller).
- DeepSeek R1 0528, 685B, score 71.4% — on the efficiency frontier (best score at its size or smaller).
Aider Polyglot: frequently asked questions
- What is the best open LLM on Aider Polyglot?
- DeepSeek R1 0528 is the top open model on Aider Polyglot, scoring 71.4%. Among all models tested — including proprietary ones — it ranks #19.
- What's the best Aider Polyglot model you can run on a 24 GB GPU?
- Qwen3 32B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 40.0% on Aider Polyglot.
- Can open models match proprietary models on Aider Polyglot?
- Not quite on Aider Polyglot: the strongest proprietary model (gpt-5-2025-08-07_high) scores 88.0%, ahead of the best open model (DeepSeek R1 0528) at 71.4% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.