Reasoning

LiveBench Reasoning Leaderboard

LiveBench Reasoning measures logical, multi-step reasoning using contamination-free questions that are refreshed regularly, so models cannot have trained on the test set.

Source: livebench13 open models ranked+39 proprietaryData through Nov 2025

Open models ranked on LiveBench Reasoning

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 6QwQ 32B · 32.8B
83.5
2 / 7DeepSeek R1 · 684.5B
83.2
3 / 12DeepSeek R1 Distill Llama 70B · 70B
67.6
4 / 14DeepSeek v3 0324 · 684.5B
65.8
5 / 17QwQ 32B Preview · 32.8B
57.7
6 / 25DeepSeek R1 Distill Qwen 32B · 32.8B
52.3
7 / 27Llama 3.3 70B Instruct · 70.6B
50.8
8 / 29Phi 4 · 14.7B
47.8
9 / 35Gemma 3 27B IT · 27.4B
43.8
10 / 38Qwen2.5 Coder 32B Instruct · 32.8B
42.1
11 / 45Gemma 2 27B IT · 27.2B
28.1
12 / 46Phi 3 Mini 4k Instruct · 3.8B
26.8
13 / 52Gemma 2 9B IT · 9.2B
15.2

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

10B100Bmodel size (log scale) →83.515.2DeepSeek R1 · 685B · 83.2DeepSeek R1 Distill Llama 70B · 70B · 67.6DeepSeek v3 0324 · 685B · 65.8QwQ 32B Preview · 33B · 57.7DeepSeek R1 Distill Qwen 32B · 33B · 52.3Llama 3.3 70B Instruct · 71B · 50.8Gemma 3 27B IT · 27B · 43.8Qwen2.5 Coder 32B Instruct · 33B · 42.1Gemma 2 27B IT · 27B · 28.1Gemma 2 9B IT · 9B · 15.2Phi 3 Mini 4k Instruct · 4B · 26.8Phi 3 Mini 4k InstructPhi 4 · 15B · 47.8Phi 4QwQ 32B · 33B · 83.5QwQ 32B
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Phi 3 Mini 4k Instruct, 4B, score 26.8 — on the efficiency frontier (best score at its size or smaller).
  • Phi 4, 15B, score 47.8 — on the efficiency frontier (best score at its size or smaller).
  • QwQ 32B, 33B, score 83.5 — on the efficiency frontier (best score at its size or smaller).

LiveBench Reasoning: frequently asked questions

What is the best open LLM on LiveBench Reasoning?
QwQ 32B is the top open model on LiveBench Reasoning, scoring 83.5. Among all models tested — including proprietary ones — it ranks #6.
What's the best LiveBench Reasoning model you can run on a 24 GB GPU?
QwQ 32B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 83.5 on LiveBench Reasoning.
What's the best LiveBench Reasoning model you can run on a 12 GB GPU?
Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 47.8 on LiveBench Reasoning.
Can open models match proprietary models on LiveBench Reasoning?
Not quite on LiveBench Reasoning: the strongest proprietary model (gpt-5.1-2025-11-13_high) scores 95.8, ahead of the best open model (QwQ 32B) at 83.5 — but you can run the open one yourself.

Scores aggregated from livebench. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.