Coding

Aider Polyglot Leaderboard

The Aider Polyglot benchmark measures real-world coding across several programming languages: the model edits code to solve Exercism exercises, and is scored on whether the final solution actually runs and passes the tests.

Source: epoch12 open models ranked+59 proprietaryData through Dec 2025

All models ranked on Aider Polyglot

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gpt-5-2025-08-07_high · proprietary
88.0%
2gpt-5-2025-08-07_medium · proprietary
86.7%
3o3-pro-2025-06-10_high · proprietary
84.9%
4gemini-2.5-pro-preview-06-05_32K · proprietary
83.1%
5gpt-5-2025-08-07_low · proprietary
81.3%
6o3-2025-04-16_high · proprietary
81.3%
7grok-4-0709 · proprietary
79.6%
8grok-4-0709_high · proprietary
79.6%
9gemini-2.5-pro-preview-06-05 · proprietary
79.1%
10gemini-2.5-pro-preview-05-06 · proprietary
76.9%
11o3-2025-04-16_medium · proprietary
76.9%
12o3-2025-04-16_unknown · proprietary
76.9%
13deepseek-reasoner · proprietary
74.2%
14DeepSeek-V3.2-Exp_thinking · proprietary
74.2%
15gemini-2.5-pro-exp-03-25 · proprietary
72.9%
16gemini-2.5-pro-preview-03-25 · proprietary
72.9%
17claude-opus-4-20250514_32K · proprietary
72.0%
18o4-mini-2025-04-16_high · proprietary
72.0%
19DeepSeek R1 0528 · 684.5B
71.4%
20claude-opus-4-20250514 · proprietary
70.7%
21deepseek-chat · proprietary
70.2%
22DeepSeek-V3.2-Exp · proprietary
70.2%
23claude-3-7-sonnet-20250219_32K · proprietary
64.9%
24o1-2024-12-17_high · proprietary
61.7%
25claude-sonnet-4-20250514_32K · proprietary
61.3%
26claude-3-7-sonnet-20250219 · proprietary
60.4%
27o3-mini-2025-01-31_high · proprietary
60.4%
28Qwen3 235B A22B · 235.1B
59.6%
29Qwen3 235B A22B Instruct 2507 · 235.1B
59.6%
30Kimi K2 Instruct · 1026.5B
59.1%
31moonshotai/kimi-k2-0905 · proprietary
59.1%
32DeepSeek R1 · 684.5B
56.9%
33claude-sonnet-4-20250514 · proprietary
56.4%
34DeepSeek v3 0324 · 684.5B
55.1%
35gemini-2.5-flash-preview-05-20_23K · proprietary
55.1%
36o3-mini-2025-01-31_medium · proprietary
53.8%
37grok-3-beta · proprietary
53.3%
38gpt-4.1-2025-04-14 · proprietary
52.4%
39claude-3-5-sonnet-20241022 · proprietary
51.6%
40grok-3-mini-beta_high · proprietary
49.3%
41DeepSeek-V3 · proprietary
48.4%
42gemini-2.5-flash-preview-04-17 · proprietary
47.1%
43chatgpt-4o-03-27-2025 · proprietary
45.3%
44gpt-4.5-preview-2025-02-27 · proprietary
44.9%
45gemini-2.5-flash-preview-05-20 · proprietary
44.0%
46gpt-oss-120b_high · proprietary
41.8%
47openai/gpt-oss-120b_high · proprietary
41.8%
48Qwen3 32B · 32.8B
40.0%
49gemini-exp-1206 · proprietary
38.2%
50gemini-2.0-pro-exp-02-05 · proprietary
35.6%
51grok-3-mini-beta_low · proprietary
34.7%
52o1-mini-2024-09-12_unknown · proprietary
32.9%
53gpt-4.1-mini-2025-04-14 · proprietary
32.4%
54claude-3-5-haiku-20241022 · proprietary
28.0%
55chatgpt-4o-01-29-2025 · proprietary
27.1%
56gpt-4o-2024-08-06 · proprietary
23.1%
57gemini-2.0-flash-exp · proprietary
22.2%
58qwen-max-2025-01-25 · proprietary
21.8%
59QwQ 32B · 32.8B
20.9%
60gemini-2.0-flash-thinking-exp-01-21 · proprietary
18.2%
61gpt-4o-2024-11-20 · proprietary
18.2%
62DeepSeek-V2.5 · proprietary
17.8%
63Qwen2.5 Coder 32B Instruct · 32.8B
16.4%
64Llama-4-Maverick-17B-128E-Instruct · proprietary
15.6%
65yi-lightning · proprietary
12.9%
66C4ai Command A 03 2025 · 111.1B
12.0%
67codestral-2501 · proprietary
11.1%
68Openhands Lm 32B v0.1 · 32.8B
10.2%
69gpt-4.1-nano-2025-04-14 · proprietary
8.9%
70Gemma 3 27B IT · 27.4B
4.9%
71gpt-4o-mini-2024-07-18 · proprietary
3.6%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

100B1Tmodel size (log scale) →71.4%4.9%Qwen3 235B A22B Instruct 2507 · 235B · 59.6%Kimi K2 Instruct · 1T · 59.1%DeepSeek R1 · 685B · 56.9%DeepSeek v3 0324 · 685B · 55.1%QwQ 32B · 33B · 20.9%Qwen2.5 Coder 32B Instruct · 33B · 16.4%C4ai Command A 03 2025 · 111B · 12.0%Openhands Lm 32B v0.1 · 33B · 10.2%Gemma 3 27B IT · 27B · 4.9%Gemma 3 27B ITQwen3 32B · 33B · 40.0%Qwen3 32BQwen3 235B A22B · 235B · 59.6%Qwen3 235B A22BDeepSeek R1 0528 · 685B · 71.4%DeepSeek R1 0528
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Gemma 3 27B IT, 27B, score 4.9% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3 32B, 33B, score 40.0% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3 235B A22B, 235B, score 59.6% — on the efficiency frontier (best score at its size or smaller).
  • DeepSeek R1 0528, 685B, score 71.4% — on the efficiency frontier (best score at its size or smaller).

Aider Polyglot: frequently asked questions

What is the best open LLM on Aider Polyglot?
DeepSeek R1 0528 is the top open model on Aider Polyglot, scoring 71.4%. Among all models tested — including proprietary ones — it ranks #19.
What's the best Aider Polyglot model you can run on a 24 GB GPU?
Qwen3 32B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 40.0% on Aider Polyglot.
Can open models match proprietary models on Aider Polyglot?
Not quite on Aider Polyglot: the strongest proprietary model (gpt-5-2025-08-07_high) scores 88.0%, ahead of the best open model (DeepSeek R1 0528) at 71.4% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.