Coding
LiveBench Coding Leaderboard
LiveBench Coding evaluates code generation and completion on fresh, contamination-free programming tasks that are updated regularly.
Source: livebench13 open models ranked+39 proprietaryData through Nov 2025
All models ranked on LiveBench Coding
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | gemini-2.5-pro-exp-03-25 · proprietary | 85.9 |
| 2 | o3-mini-2025-01-31_high · proprietary | 82.7 |
| 3 | gpt-4.5-preview-2025-02-27 · proprietary | 75.2 |
| 4 | gpt-5.1-2025-11-13_high · proprietary | 72.5 |
| 5 | QwQ 32B · 32.8B | 72.2 |
| 6 | DeepSeek v3 0324 · 684.5B | 70.9 |
| 7 | o1-2024-12-17_high · proprietary | 69.7 |
| 8 | claude-3-7-sonnet-20250219 · proprietary | 67.5 |
| 9 | claude-3-5-sonnet-20241022 · proprietary | 67.1 |
| 10 | DeepSeek R1 · 684.5B | 66.7 |
| 11 | o3-mini-2025-01-31_medium · proprietary | 65.4 |
| 12 | qwen2.5-max · proprietary | 64.4 |
| 13 | gemini-2.0-pro-exp-02-05 · proprietary | 63.5 |
| 14 | gemini-exp-1206 · proprietary | 63.4 |
| 15 | DeepSeek-V3 · proprietary | 61.8 |
| 16 | o3-mini-2025-01-31_low · proprietary | 61.5 |
| 17 | Dracarys2-72B-Instruct · proprietary | 58.9 |
| 18 | Qwen2.5 Coder 32B Instruct · 32.8B | 56.9 |
| 19 | gemini-2.0-flash-exp · proprietary | 54.4 |
| 20 | gemini-2.0-flash-001 · proprietary | 53.9 |
| 21 | gemini-2.0-flash-thinking-exp-01-21 · proprietary | 53.5 |
| 22 | DeepSeek R1 Distill Llama 70B · 70B | 51.6 |
| 23 | gpt-4o-2024-08-06 · proprietary | 51.4 |
| 24 | claude-3-5-haiku-20241022 · proprietary | 51.4 |
| 25 | o1-mini-2024-09-12_medium · proprietary | 48.0 |
| 26 | gemini-2.0-flash-lite · proprietary | 47.1 |
| 27 | mistral-large-2411 · proprietary | 47.1 |
| 28 | learnlm-1.5-pro-experimental · proprietary | 46.9 |
| 29 | grok-2-1212 · proprietary | 46.4 |
| 30 | gpt-4o-2024-11-20 · proprietary | 46.1 |
| 31 | gemini-2.0-flash-lite-preview-02-05 · proprietary | 43.8 |
| 32 | gpt-4o-mini-2024-07-18 · proprietary | 43.1 |
| 33 | Gemma 3 27B IT · 27.4B | 39.9 |
| 34 | claude-3-opus-20240229 · proprietary | 38.6 |
| 35 | amazon.nova-pro-v1:0 · proprietary | 38.1 |
| 36 | QwQ 32B Preview · 32.8B | 37.2 |
| 37 | Llama 3.3 70B Instruct · 70.6B | 36.6 |
| 38 | Dracarys2-Llama-3.1-70B-Instruct · proprietary | 36.3 |
| 39 | mistral-small-2503 · proprietary | 36.2 |
| 40 | Gemma 2 27B IT · 27.2B | 36.0 |
| 41 | mistral-small-2501 · proprietary | 35.3 |
| 42 | sonar · proprietary | 35.1 |
| 43 | DeepSeek R1 Distill Qwen 32B · 32.8B | 33.7 |
| 44 | Phi 4 · 14.7B | 30.7 |
| 45 | amazon.nova-lite-v1:0 · proprietary | 27.5 |
| 46 | Gemma 2 9B IT · 9.2B | 22.5 |
| 47 | Phi-3-small-8k-instruct · proprietary | 20.3 |
| 48 | amazon.nova-micro-v1:0 · proprietary | 20.2 |
| 49 | c4ai-command-r-plus-08-2024 · proprietary | 19.1 |
| 50 | c4ai-command-r-08-2024 · proprietary | 17.9 |
| 51 | Phi 3 Mini 4k Instruct · 3.8B | 15.5 |
| 52 | OLMo-2-1124-13B-Instruct · proprietary | 10.4 |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Phi 3 Mini 4k Instruct, 4B, score 15.5 — on the efficiency frontier (best score at its size or smaller).
- Gemma 2 9B IT, 9B, score 22.5 — on the efficiency frontier (best score at its size or smaller).
- Phi 4, 15B, score 30.7 — on the efficiency frontier (best score at its size or smaller).
- Gemma 2 27B IT, 27B, score 36.0 — on the efficiency frontier (best score at its size or smaller).
- Gemma 3 27B IT, 27B, score 39.9 — on the efficiency frontier (best score at its size or smaller).
- QwQ 32B, 33B, score 72.2 — on the efficiency frontier (best score at its size or smaller).
LiveBench Coding: frequently asked questions
- What is the best open LLM on LiveBench Coding?
- QwQ 32B is the top open model on LiveBench Coding, scoring 72.2. Among all models tested — including proprietary ones — it ranks #5.
- What's the best LiveBench Coding model you can run on a 24 GB GPU?
- QwQ 32B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 72.2 on LiveBench Coding.
- What's the best LiveBench Coding model you can run on a 12 GB GPU?
- Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 30.7 on LiveBench Coding.
- Can open models match proprietary models on LiveBench Coding?
- Not quite on LiveBench Coding: the strongest proprietary model (gemini-2.5-pro-exp-03-25) scores 85.9, ahead of the best open model (QwQ 32B) at 72.2 — but you can run the open one yourself.
Scores aggregated from livebench. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.