Coding
Aider Polyglot Leaderboard
The Aider Polyglot benchmark measures real-world coding across several programming languages: the model edits code to solve Exercism exercises, and is scored on whether the final solution actually runs and passes the tests.
Source: epoch12 open models ranked+59 proprietaryData through Dec 2025
All models ranked on Aider Polyglot
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5-2025-08-07_high · proprietary | 88.0% |
| 2 | gpt-5-2025-08-07_medium · proprietary | 86.7% |
| 3 | o3-pro-2025-06-10_high · proprietary | 84.9% |
| 4 | gemini-2.5-pro-preview-06-05_32K · proprietary | 83.1% |
| 5 | gpt-5-2025-08-07_low · proprietary | 81.3% |
| 6 | o3-2025-04-16_high · proprietary | 81.3% |
| 7 | grok-4-0709 · proprietary | 79.6% |
| 8 | grok-4-0709_high · proprietary | 79.6% |
| 9 | gemini-2.5-pro-preview-06-05 · proprietary | 79.1% |
| 10 | gemini-2.5-pro-preview-05-06 · proprietary | 76.9% |
| 11 | o3-2025-04-16_medium · proprietary | 76.9% |
| 12 | o3-2025-04-16_unknown · proprietary | 76.9% |
| 13 | deepseek-reasoner · proprietary | 74.2% |
| 14 | DeepSeek-V3.2-Exp_thinking · proprietary | 74.2% |
| 15 | gemini-2.5-pro-exp-03-25 · proprietary | 72.9% |
| 16 | gemini-2.5-pro-preview-03-25 · proprietary | 72.9% |
| 17 | claude-opus-4-20250514_32K · proprietary | 72.0% |
| 18 | o4-mini-2025-04-16_high · proprietary | 72.0% |
| 19 | DeepSeek R1 0528 · 684.5B | 71.4% |
| 20 | claude-opus-4-20250514 · proprietary | 70.7% |
| 21 | deepseek-chat · proprietary | 70.2% |
| 22 | DeepSeek-V3.2-Exp · proprietary | 70.2% |
| 23 | claude-3-7-sonnet-20250219_32K · proprietary | 64.9% |
| 24 | o1-2024-12-17_high · proprietary | 61.7% |
| 25 | claude-sonnet-4-20250514_32K · proprietary | 61.3% |
| 26 | claude-3-7-sonnet-20250219 · proprietary | 60.4% |
| 27 | o3-mini-2025-01-31_high · proprietary | 60.4% |
| 28 | Qwen3 235B A22B · 235.1B | 59.6% |
| 29 | Qwen3 235B A22B Instruct 2507 · 235.1B | 59.6% |
| 30 | Kimi K2 Instruct · 1026.5B | 59.1% |
| 31 | moonshotai/kimi-k2-0905 · proprietary | 59.1% |
| 32 | DeepSeek R1 · 684.5B | 56.9% |
| 33 | claude-sonnet-4-20250514 · proprietary | 56.4% |
| 34 | DeepSeek v3 0324 · 684.5B | 55.1% |
| 35 | gemini-2.5-flash-preview-05-20_23K · proprietary | 55.1% |
| 36 | o3-mini-2025-01-31_medium · proprietary | 53.8% |
| 37 | grok-3-beta · proprietary | 53.3% |
| 38 | gpt-4.1-2025-04-14 · proprietary | 52.4% |
| 39 | claude-3-5-sonnet-20241022 · proprietary | 51.6% |
| 40 | grok-3-mini-beta_high · proprietary | 49.3% |
| 41 | DeepSeek-V3 · proprietary | 48.4% |
| 42 | gemini-2.5-flash-preview-04-17 · proprietary | 47.1% |
| 43 | chatgpt-4o-03-27-2025 · proprietary | 45.3% |
| 44 | gpt-4.5-preview-2025-02-27 · proprietary | 44.9% |
| 45 | gemini-2.5-flash-preview-05-20 · proprietary | 44.0% |
| 46 | gpt-oss-120b_high · proprietary | 41.8% |
| 47 | openai/gpt-oss-120b_high · proprietary | 41.8% |
| 48 | Qwen3 32B · 32.8B | 40.0% |
| 49 | gemini-exp-1206 · proprietary | 38.2% |
| 50 | gemini-2.0-pro-exp-02-05 · proprietary | 35.6% |
| 51 | grok-3-mini-beta_low · proprietary | 34.7% |
| 52 | o1-mini-2024-09-12_unknown · proprietary | 32.9% |
| 53 | gpt-4.1-mini-2025-04-14 · proprietary | 32.4% |
| 54 | claude-3-5-haiku-20241022 · proprietary | 28.0% |
| 55 | chatgpt-4o-01-29-2025 · proprietary | 27.1% |
| 56 | gpt-4o-2024-08-06 · proprietary | 23.1% |
| 57 | gemini-2.0-flash-exp · proprietary | 22.2% |
| 58 | qwen-max-2025-01-25 · proprietary | 21.8% |
| 59 | QwQ 32B · 32.8B | 20.9% |
| 60 | gemini-2.0-flash-thinking-exp-01-21 · proprietary | 18.2% |
| 61 | gpt-4o-2024-11-20 · proprietary | 18.2% |
| 62 | DeepSeek-V2.5 · proprietary | 17.8% |
| 63 | Qwen2.5 Coder 32B Instruct · 32.8B | 16.4% |
| 64 | Llama-4-Maverick-17B-128E-Instruct · proprietary | 15.6% |
| 65 | yi-lightning · proprietary | 12.9% |
| 66 | C4ai Command A 03 2025 · 111.1B | 12.0% |
| 67 | codestral-2501 · proprietary | 11.1% |
| 68 | Openhands Lm 32B v0.1 · 32.8B | 10.2% |
| 69 | gpt-4.1-nano-2025-04-14 · proprietary | 8.9% |
| 70 | Gemma 3 27B IT · 27.4B | 4.9% |
| 71 | gpt-4o-mini-2024-07-18 · proprietary | 3.6% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Gemma 3 27B IT, 27B, score 4.9% — on the efficiency frontier (best score at its size or smaller).
- Qwen3 32B, 33B, score 40.0% — on the efficiency frontier (best score at its size or smaller).
- Qwen3 235B A22B, 235B, score 59.6% — on the efficiency frontier (best score at its size or smaller).
- DeepSeek R1 0528, 685B, score 71.4% — on the efficiency frontier (best score at its size or smaller).
Aider Polyglot: frequently asked questions
- What is the best open LLM on Aider Polyglot?
- DeepSeek R1 0528 is the top open model on Aider Polyglot, scoring 71.4%. Among all models tested — including proprietary ones — it ranks #19.
- What's the best Aider Polyglot model you can run on a 24 GB GPU?
- Qwen3 32B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 40.0% on Aider Polyglot.
- Can open models match proprietary models on Aider Polyglot?
- Not quite on Aider Polyglot: the strongest proprietary model (gpt-5-2025-08-07_high) scores 88.0%, ahead of the best open model (DeepSeek R1 0528) at 71.4% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.