Math
LiveBench Math Leaderboard
LiveBench Math measures mathematical problem-solving on contamination-free, regularly-refreshed questions, including competition-style problems.
Source: livebench13 open models ranked+39 proprietaryData through Nov 2025
Open models ranked on LiveBench Math
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 3 | DeepSeek R1 · 684.5B | 80.7 |
| 2 / 5 | QwQ 32B · 32.8B | 77.8 |
| 3 / 8 | DeepSeek v3 0324 · 684.5B | 73.5 |
| 4 / 19 | DeepSeek R1 Distill Qwen 32B · 32.8B | 59.4 |
| 5 / 21 | QwQ 32B Preview · 32.8B | 58.3 |
| 6 / 22 | DeepSeek R1 Distill Llama 70B · 70B | 58.1 |
| 7 / 26 | Gemma 3 27B IT · 27.4B | 55.4 |
| 8 / 31 | Qwen2.5 Coder 32B Instruct · 32.8B | 46.6 |
| 9 / 35 | Llama 3.3 70B Instruct · 70.6B | 42.2 |
| 10 / 36 | Phi 4 · 14.7B | 42.0 |
| 11 / 46 | Gemma 2 27B IT · 27.2B | 26.5 |
| 12 / 48 | Gemma 2 9B IT · 9.2B | 19.8 |
| 13 / 51 | Phi 3 Mini 4k Instruct · 3.8B | 15.0 |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Phi 3 Mini 4k Instruct, 4B, score 15.0 — on the efficiency frontier (best score at its size or smaller).
- Gemma 2 9B IT, 9B, score 19.8 — on the efficiency frontier (best score at its size or smaller).
- Phi 4, 15B, score 42.0 — on the efficiency frontier (best score at its size or smaller).
- Gemma 3 27B IT, 27B, score 55.4 — on the efficiency frontier (best score at its size or smaller).
- QwQ 32B, 33B, score 77.8 — on the efficiency frontier (best score at its size or smaller).
- DeepSeek R1, 685B, score 80.7 — on the efficiency frontier (best score at its size or smaller).
LiveBench Math: frequently asked questions
- What is the best open LLM on LiveBench Math?
- DeepSeek R1 is the top open model on LiveBench Math, scoring 80.7. Among all models tested — including proprietary ones — it ranks #3.
- What's the best LiveBench Math model you can run on a 24 GB GPU?
- QwQ 32B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 77.8 on LiveBench Math.
- What's the best LiveBench Math model you can run on a 12 GB GPU?
- Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 42.0 on LiveBench Math.
- Can open models match proprietary models on LiveBench Math?
- Not quite on LiveBench Math: the strongest proprietary model (gpt-5.1-2025-11-13_high) scores 94.5, ahead of the best open model (DeepSeek R1) at 80.7 — but you can run the open one yourself.
Scores aggregated from livebench. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.