Math

LiveBench Math Leaderboard

LiveBench Math measures mathematical problem-solving on contamination-free, regularly-refreshed questions, including competition-style problems.

Source: livebench13 open models ranked+39 proprietaryData through Nov 2025

All models ranked on LiveBench Math

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gpt-5.1-2025-11-13_high · proprietary
94.5
2gemini-2.5-pro-exp-03-25 · proprietary
90.2
3DeepSeek R1 · 684.5B
80.7
4o1-2024-12-17_high · proprietary
80.3
5QwQ 32B · 32.8B
77.8
6o3-mini-2025-01-31_high · proprietary
77.3
7gemini-2.0-flash-thinking-exp-01-21 · proprietary
75.8
8DeepSeek v3 0324 · 684.5B
73.5
9o3-mini-2025-01-31_medium · proprietary
72.4
10gemini-exp-1206 · proprietary
72.4
11gemini-2.0-pro-exp-02-05 · proprietary
71.0
12gpt-4.5-preview-2025-02-27 · proprietary
69.3
13gemini-2.0-flash-001 · proprietary
65.6
14claude-3-7-sonnet-20250219 · proprietary
63.3
15o3-mini-2025-01-31_low · proprietary
63.1
16o1-mini-2024-09-12_medium · proprietary
62.0
17DeepSeek-V3 · proprietary
60.5
18gemini-2.0-flash-exp · proprietary
60.4
19DeepSeek R1 Distill Qwen 32B · 32.8B
59.4
20qwen2.5-max · proprietary
58.4
21QwQ 32B Preview · 32.8B
58.3
22DeepSeek R1 Distill Llama 70B · 70B
58.1
23gemini-2.0-flash-lite · proprietary
58.1
24learnlm-1.5-pro-experimental · proprietary
57.8
25gemini-2.0-flash-lite-preview-02-05 · proprietary
55.5
26Gemma 3 27B IT · 27.4B
55.4
27grok-2-1212 · proprietary
54.9
28Dracarys2-72B-Instruct · proprietary
54.7
29claude-3-5-sonnet-20241022 · proprietary
52.3
30gpt-4o-2024-08-06 · proprietary
49.5
31Qwen2.5 Coder 32B Instruct · 32.8B
46.6
32claude-3-opus-20240229 · proprietary
43.6
33gpt-4o-2024-11-20 · proprietary
42.9
34mistral-large-2411 · proprietary
42.5
35Llama 3.3 70B Instruct · 70.6B
42.2
36Phi 4 · 14.7B
42.0
37sonar · proprietary
41.6
38Dracarys2-Llama-3.1-70B-Instruct · proprietary
40.3
39mistral-small-2501 · proprietary
39.9
40mistral-small-2503 · proprietary
39.4
41amazon.nova-pro-v1:0 · proprietary
38.0
42amazon.nova-lite-v1:0 · proprietary
36.7
43gpt-4o-mini-2024-07-18 · proprietary
36.3
44claude-3-5-haiku-20241022 · proprietary
35.5
45amazon.nova-micro-v1:0 · proprietary
34.5
46Gemma 2 27B IT · 27.2B
26.5
47c4ai-command-r-plus-08-2024 · proprietary
21.3
48Gemma 2 9B IT · 9.2B
19.8
49c4ai-command-r-08-2024 · proprietary
19.4
50Phi-3-small-8k-instruct · proprietary
17.6
51Phi 3 Mini 4k Instruct · 3.8B
15.0
52OLMo-2-1124-13B-Instruct · proprietary
13.6

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

10B100Bmodel size (log scale) →80.715.0DeepSeek v3 0324 · 685B · 73.5DeepSeek R1 Distill Qwen 32B · 33B · 59.4QwQ 32B Preview · 33B · 58.3DeepSeek R1 Distill Llama 70B · 70B · 58.1Qwen2.5 Coder 32B Instruct · 33B · 46.6Llama 3.3 70B Instruct · 71B · 42.2Gemma 2 27B IT · 27B · 26.5Phi 3 Mini 4k Instruct · 4B · 15.0Phi 3 Mini 4k InstructGemma 2 9B IT · 9B · 19.8Gemma 2 9B ITPhi 4 · 15B · 42.0Phi 4Gemma 3 27B IT · 27B · 55.4Gemma 3 27B ITQwQ 32B · 33B · 77.8QwQ 32BDeepSeek R1 · 685B · 80.7DeepSeek R1
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Phi 3 Mini 4k Instruct, 4B, score 15.0 — on the efficiency frontier (best score at its size or smaller).
  • Gemma 2 9B IT, 9B, score 19.8 — on the efficiency frontier (best score at its size or smaller).
  • Phi 4, 15B, score 42.0 — on the efficiency frontier (best score at its size or smaller).
  • Gemma 3 27B IT, 27B, score 55.4 — on the efficiency frontier (best score at its size or smaller).
  • QwQ 32B, 33B, score 77.8 — on the efficiency frontier (best score at its size or smaller).
  • DeepSeek R1, 685B, score 80.7 — on the efficiency frontier (best score at its size or smaller).

LiveBench Math: frequently asked questions

What is the best open LLM on LiveBench Math?
DeepSeek R1 is the top open model on LiveBench Math, scoring 80.7. Among all models tested — including proprietary ones — it ranks #3.
What's the best LiveBench Math model you can run on a 24 GB GPU?
QwQ 32B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 77.8 on LiveBench Math.
What's the best LiveBench Math model you can run on a 12 GB GPU?
Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 42.0 on LiveBench Math.
Can open models match proprietary models on LiveBench Math?
Not quite on LiveBench Math: the strongest proprietary model (gpt-5.1-2025-11-13_high) scores 94.5, ahead of the best open model (DeepSeek R1) at 80.7 — but you can run the open one yourself.

Scores aggregated from livebench. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.