Coding

Aider Polyglot Leaderboard

The Aider Polyglot benchmark measures real-world coding across several programming languages: the model edits code to solve Exercism exercises, and is scored on whether the final solution actually runs and passes the tests.

Source: epoch12 open models ranked+59 proprietaryData through Dec 2025

Open models ranked on Aider Polyglot

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 19DeepSeek R1 0528 · 684.5B
71.4%
2 / 28Qwen3 235B A22B · 235.1B
59.6%
3 / 29Qwen3 235B A22B Instruct 2507 · 235.1B
59.6%
4 / 30Kimi K2 Instruct · 1026.5B
59.1%
5 / 32DeepSeek R1 · 684.5B
56.9%
6 / 34DeepSeek v3 0324 · 684.5B
55.1%
7 / 48Qwen3 32B · 32.8B
40.0%
8 / 59QwQ 32B · 32.8B
20.9%
9 / 63Qwen2.5 Coder 32B Instruct · 32.8B
16.4%
10 / 66C4ai Command A 03 2025 · 111.1B
12.0%
11 / 68Openhands Lm 32B v0.1 · 32.8B
10.2%
12 / 70Gemma 3 27B IT · 27.4B
4.9%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

100B1Tmodel size (log scale) →71.4%4.9%Qwen3 235B A22B Instruct 2507 · 235B · 59.6%Kimi K2 Instruct · 1T · 59.1%DeepSeek R1 · 685B · 56.9%DeepSeek v3 0324 · 685B · 55.1%QwQ 32B · 33B · 20.9%Qwen2.5 Coder 32B Instruct · 33B · 16.4%C4ai Command A 03 2025 · 111B · 12.0%Openhands Lm 32B v0.1 · 33B · 10.2%Gemma 3 27B IT · 27B · 4.9%Gemma 3 27B ITQwen3 32B · 33B · 40.0%Qwen3 32BQwen3 235B A22B · 235B · 59.6%Qwen3 235B A22BDeepSeek R1 0528 · 685B · 71.4%DeepSeek R1 0528
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Gemma 3 27B IT, 27B, score 4.9% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3 32B, 33B, score 40.0% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3 235B A22B, 235B, score 59.6% — on the efficiency frontier (best score at its size or smaller).
  • DeepSeek R1 0528, 685B, score 71.4% — on the efficiency frontier (best score at its size or smaller).

Aider Polyglot: frequently asked questions

What is the best open LLM on Aider Polyglot?
DeepSeek R1 0528 is the top open model on Aider Polyglot, scoring 71.4%. Among all models tested — including proprietary ones — it ranks #19.
What's the best Aider Polyglot model you can run on a 24 GB GPU?
Qwen3 32B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 40.0% on Aider Polyglot.
Can open models match proprietary models on Aider Polyglot?
Not quite on Aider Polyglot: the strongest proprietary model (gpt-5-2025-08-07_high) scores 88.0%, ahead of the best open model (DeepSeek R1 0528) at 71.4% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.