Coding

LiveBench Coding Leaderboard

LiveBench Coding evaluates code generation and completion on fresh, contamination-free programming tasks that are updated regularly.

Source: livebench13 open models ranked+39 proprietaryData through Nov 2025

All models ranked on LiveBench Coding

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gemini-2.5-pro-exp-03-25 · proprietary
85.9
2o3-mini-2025-01-31_high · proprietary
82.7
3gpt-4.5-preview-2025-02-27 · proprietary
75.2
4gpt-5.1-2025-11-13_high · proprietary
72.5
5QwQ 32B · 32.8B
72.2
6DeepSeek v3 0324 · 684.5B
70.9
7o1-2024-12-17_high · proprietary
69.7
8claude-3-7-sonnet-20250219 · proprietary
67.5
9claude-3-5-sonnet-20241022 · proprietary
67.1
10DeepSeek R1 · 684.5B
66.7
11o3-mini-2025-01-31_medium · proprietary
65.4
12qwen2.5-max · proprietary
64.4
13gemini-2.0-pro-exp-02-05 · proprietary
63.5
14gemini-exp-1206 · proprietary
63.4
15DeepSeek-V3 · proprietary
61.8
16o3-mini-2025-01-31_low · proprietary
61.5
17Dracarys2-72B-Instruct · proprietary
58.9
18Qwen2.5 Coder 32B Instruct · 32.8B
56.9
19gemini-2.0-flash-exp · proprietary
54.4
20gemini-2.0-flash-001 · proprietary
53.9
21gemini-2.0-flash-thinking-exp-01-21 · proprietary
53.5
22DeepSeek R1 Distill Llama 70B · 70B
51.6
23gpt-4o-2024-08-06 · proprietary
51.4
24claude-3-5-haiku-20241022 · proprietary
51.4
25o1-mini-2024-09-12_medium · proprietary
48.0
26gemini-2.0-flash-lite · proprietary
47.1
27mistral-large-2411 · proprietary
47.1
28learnlm-1.5-pro-experimental · proprietary
46.9
29grok-2-1212 · proprietary
46.4
30gpt-4o-2024-11-20 · proprietary
46.1
31gemini-2.0-flash-lite-preview-02-05 · proprietary
43.8
32gpt-4o-mini-2024-07-18 · proprietary
43.1
33Gemma 3 27B IT · 27.4B
39.9
34claude-3-opus-20240229 · proprietary
38.6
35amazon.nova-pro-v1:0 · proprietary
38.1
36QwQ 32B Preview · 32.8B
37.2
37Llama 3.3 70B Instruct · 70.6B
36.6
38Dracarys2-Llama-3.1-70B-Instruct · proprietary
36.3
39mistral-small-2503 · proprietary
36.2
40Gemma 2 27B IT · 27.2B
36.0
41mistral-small-2501 · proprietary
35.3
42sonar · proprietary
35.1
43DeepSeek R1 Distill Qwen 32B · 32.8B
33.7
44Phi 4 · 14.7B
30.7
45amazon.nova-lite-v1:0 · proprietary
27.5
46Gemma 2 9B IT · 9.2B
22.5
47Phi-3-small-8k-instruct · proprietary
20.3
48amazon.nova-micro-v1:0 · proprietary
20.2
49c4ai-command-r-plus-08-2024 · proprietary
19.1
50c4ai-command-r-08-2024 · proprietary
17.9
51Phi 3 Mini 4k Instruct · 3.8B
15.5
52OLMo-2-1124-13B-Instruct · proprietary
10.4

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

10B100Bmodel size (log scale) →72.215.5DeepSeek v3 0324 · 685B · 70.9DeepSeek R1 · 685B · 66.7Qwen2.5 Coder 32B Instruct · 33B · 56.9DeepSeek R1 Distill Llama 70B · 70B · 51.6QwQ 32B Preview · 33B · 37.2Llama 3.3 70B Instruct · 71B · 36.6DeepSeek R1 Distill Qwen 32B · 33B · 33.7Phi 3 Mini 4k Instruct · 4B · 15.5Phi 3 Mini 4k InstructGemma 2 9B IT · 9B · 22.5Gemma 2 9B ITPhi 4 · 15B · 30.7Phi 4Gemma 2 27B IT · 27B · 36.0Gemma 2 27B ITGemma 3 27B IT · 27B · 39.9Gemma 3 27B ITQwQ 32B · 33B · 72.2QwQ 32B
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Phi 3 Mini 4k Instruct, 4B, score 15.5 — on the efficiency frontier (best score at its size or smaller).
  • Gemma 2 9B IT, 9B, score 22.5 — on the efficiency frontier (best score at its size or smaller).
  • Phi 4, 15B, score 30.7 — on the efficiency frontier (best score at its size or smaller).
  • Gemma 2 27B IT, 27B, score 36.0 — on the efficiency frontier (best score at its size or smaller).
  • Gemma 3 27B IT, 27B, score 39.9 — on the efficiency frontier (best score at its size or smaller).
  • QwQ 32B, 33B, score 72.2 — on the efficiency frontier (best score at its size or smaller).

LiveBench Coding: frequently asked questions

What is the best open LLM on LiveBench Coding?
QwQ 32B is the top open model on LiveBench Coding, scoring 72.2. Among all models tested — including proprietary ones — it ranks #5.
What's the best LiveBench Coding model you can run on a 24 GB GPU?
QwQ 32B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 72.2 on LiveBench Coding.
What's the best LiveBench Coding model you can run on a 12 GB GPU?
Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 30.7 on LiveBench Coding.
Can open models match proprietary models on LiveBench Coding?
Not quite on LiveBench Coding: the strongest proprietary model (gemini-2.5-pro-exp-03-25) scores 85.9, ahead of the best open model (QwQ 32B) at 72.2 — but you can run the open one yourself.

Scores aggregated from livebench. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.