Reasoning

LiveBench Reasoning Leaderboard

LiveBench Reasoning measures logical, multi-step reasoning using contamination-free questions that are refreshed regularly, so models cannot have trained on the test set.

Source: livebench13 open models ranked+39 proprietaryData through Nov 2025

All models ranked on LiveBench Reasoning

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gpt-5.1-2025-11-13_high · proprietary
95.8
2o1-2024-12-17_high · proprietary
91.6
3gemini-2.5-pro-exp-03-25 · proprietary
89.8
4o3-mini-2025-01-31_high · proprietary
89.6
5o3-mini-2025-01-31_medium · proprietary
86.3
6QwQ 32B · 32.8B
83.5
7DeepSeek R1 · 684.5B
83.2
8gemini-2.0-flash-thinking-exp-01-21 · proprietary
78.2
9o1-mini-2024-09-12_medium · proprietary
72.3
10gpt-4.5-preview-2025-02-27 · proprietary
71.1
11o3-mini-2025-01-31_low · proprietary
69.8
12DeepSeek R1 Distill Llama 70B · 70B
67.6
13claude-3-7-sonnet-20250219 · proprietary
66.0
14DeepSeek v3 0324 · 684.5B
65.8
15gemini-2.0-pro-exp-02-05 · proprietary
60.1
16gemini-2.0-flash-exp · proprietary
59.1
17QwQ 32B Preview · 32.8B
57.7
18gemini-exp-1206 · proprietary
57.0
19DeepSeek-V3 · proprietary
56.8
20claude-3-5-sonnet-20241022 · proprietary
56.7
21gpt-4o-2024-11-20 · proprietary
55.8
22gemini-2.0-flash-001 · proprietary
55.3
23grok-2-1212 · proprietary
54.8
24gpt-4o-2024-08-06 · proprietary
53.9
25DeepSeek R1 Distill Qwen 32B · 32.8B
52.3
26qwen2.5-max · proprietary
51.4
27Llama 3.3 70B Instruct · 70.6B
50.8
28gemini-2.0-flash-lite-preview-02-05 · proprietary
50.1
29Phi 4 · 14.7B
47.8
30Dracarys2-72B-Instruct · proprietary
47.4
31sonar · proprietary
46.3
32gemini-2.0-flash-lite · proprietary
44.9
33mistral-small-2503 · proprietary
44.8
34Dracarys2-Llama-3.1-70B-Instruct · proprietary
44.7
35Gemma 3 27B IT · 27.4B
43.8
36mistral-large-2411 · proprietary
43.5
37learnlm-1.5-pro-experimental · proprietary
43.4
38Qwen2.5 Coder 32B Instruct · 32.8B
42.1
39claude-3-opus-20240229 · proprietary
40.6
40amazon.nova-lite-v1:0 · proprietary
36.7
41mistral-small-2501 · proprietary
36.4
42gpt-4o-mini-2024-07-18 · proprietary
32.8
43amazon.nova-pro-v1:0 · proprietary
32.6
44claude-3-5-haiku-20241022 · proprietary
28.1
45Gemma 2 27B IT · 27.2B
28.1
46Phi 3 Mini 4k Instruct · 3.8B
26.8
47amazon.nova-micro-v1:0 · proprietary
25.1
48c4ai-command-r-plus-08-2024 · proprietary
24.8
49c4ai-command-r-08-2024 · proprietary
21.9
50OLMo-2-1124-13B-Instruct · proprietary
16.3
51Phi-3-small-8k-instruct · proprietary
15.9
52Gemma 2 9B IT · 9.2B
15.2

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

10B100Bmodel size (log scale) →83.515.2DeepSeek R1 · 685B · 83.2DeepSeek R1 Distill Llama 70B · 70B · 67.6DeepSeek v3 0324 · 685B · 65.8QwQ 32B Preview · 33B · 57.7DeepSeek R1 Distill Qwen 32B · 33B · 52.3Llama 3.3 70B Instruct · 71B · 50.8Gemma 3 27B IT · 27B · 43.8Qwen2.5 Coder 32B Instruct · 33B · 42.1Gemma 2 27B IT · 27B · 28.1Gemma 2 9B IT · 9B · 15.2Phi 3 Mini 4k Instruct · 4B · 26.8Phi 3 Mini 4k InstructPhi 4 · 15B · 47.8Phi 4QwQ 32B · 33B · 83.5QwQ 32B
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Phi 3 Mini 4k Instruct, 4B, score 26.8 — on the efficiency frontier (best score at its size or smaller).
  • Phi 4, 15B, score 47.8 — on the efficiency frontier (best score at its size or smaller).
  • QwQ 32B, 33B, score 83.5 — on the efficiency frontier (best score at its size or smaller).

LiveBench Reasoning: frequently asked questions

What is the best open LLM on LiveBench Reasoning?
QwQ 32B is the top open model on LiveBench Reasoning, scoring 83.5. Among all models tested — including proprietary ones — it ranks #6.
What's the best LiveBench Reasoning model you can run on a 24 GB GPU?
QwQ 32B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 83.5 on LiveBench Reasoning.
What's the best LiveBench Reasoning model you can run on a 12 GB GPU?
Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 47.8 on LiveBench Reasoning.
Can open models match proprietary models on LiveBench Reasoning?
Not quite on LiveBench Reasoning: the strongest proprietary model (gpt-5.1-2025-11-13_high) scores 95.8, ahead of the best open model (QwQ 32B) at 83.5 — but you can run the open one yourself.

Scores aggregated from livebench. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.