What is the best open LLM on MATH Level 5?

DeepSeek R1 0528 is the top open model on MATH Level 5, scoring 96.6%. Among all models tested — including proprietary ones — it ranks #9. The top model overall is GPT 5 (Aug 07, 2025, high) (OpenAI) at 98.1%.

What's the best MATH Level 5 model you can run on a 24 GB GPU?

DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.

What's the best MATH Level 5 model you can run on a 12 GB GPU?

DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.

Can open models match proprietary models on MATH Level 5?

Not quite on MATH Level 5: the strongest proprietary model (GPT 5 (Aug 07, 2025, high)) scores 98.1%, ahead of the best open model (DeepSeek R1 0528) at 96.6% — but you can run the open one yourself.

Math

MATH Level 5 Leaderboard

Name: MATH Level 5 — open LLM scores
Creator: epoch

MATH Level 5 covers the hardest tier of competition-mathematics problems, testing multi-step symbolic and quantitative reasoning.

Source: epoch32 open models ranked+76 proprietaryData through Oct 2025

Open models All models

Open models ranked on MATH Level 5

# shows rank among open models / rank overall (including proprietary).

#	Model	Score
1 / 9	DeepSeek R1 0528 · 684.5B	96.6%
2 / 19	DeepSeek R1 · 684.5B	93.0%
3 / 23	DeepSeek R1 Distill Llama 70B · 70.6B	89.9%
4 / 28	DeepSeek R1 Distill Qwen 14B · 14.8B	87.1%
5 / 40	DeepSeek v3 0324 · 684.5B	75.5%
6 / 41	Gemma 3 27B IT · 27.4B	74.0%
7 / 42	Llama 4 Maverick 17B 128E Instruct · 401.6B	73.0%
8 / 45	Qwen3 235B A22B · 235.1B	68.9%
9 / 49	Phi 4 · 14.7B	64.9%
10 / 50	DeepSeek v3 · 684.5B	64.8%
11 / 52	Qwen2.5 72B Instruct · 72.7B	63.2%
12 / 53	Llama 4 Scout 17B 16E Instruct · 108.6B	62.3%
13 / 57	Qwen2.5 32B Instruct · 32.8B	56.1%
14 / 64	Llama 3.1 405B Instruct · 405.9B	49.8%
15 / 70	Llama 3.1 Tulu 3 70B DPO · 70.6B	42.7%
16 / 71	Llama 3.3 70B Instruct · 70.6B	41.6%
17 / 74	Llama 3.2 90B Vision Instruct · 88.6B	39.4%
18 / 75	Qwen2 72B Instruct · 72.7B	39.1%
19 / 77	Llama 3.1 70B Instruct · 70.6B	36.7%
20 / 79	Gemma 2 27B IT · 27.2B	27.9%
21 / 80	WizardLM 2 8x22B · 140.6B	25.7%
22 / 81	Yi 1.5 34B Chat · 34.4B	25.5%
23 / 86	Llama 3.1 8B Instruct · 8.0B	22.9%
24 / 87	Hermes 2 Theta Llama 3 70B · 70.6B	22.7%
25 / 88	Meta Llama 3 70B Instruct · 70.6B	22.6%
26 / 89	Gemma 2 9B IT · 9.2B	21.0%
27 / 102	Mixtral 8x7B Instruct v0.1 · 46.7B	9.3%
28 / 103	Deepseek Llm 67B Chat · 67B	6.4%
29 / 104	Meta Llama 3 8B Instruct · 8.0B	6.1%
30 / 105	Yi 34B Chat · 34.4B	5.1%
31 / 107	Mistral 7B Instruct v0.3 · 7.2B	3.6%
32 / 108	Llama 2 70B Chat HF · 69.0B	3.3%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

MATH Level 5: frequently asked questions

What is the best open LLM on MATH Level 5?: DeepSeek R1 0528 is the top open model on MATH Level 5, scoring 96.6%. Among all models tested — including proprietary ones — it ranks #9. The top model overall is GPT 5 (Aug 07, 2025, high) (OpenAI) at 98.1%.
What's the best MATH Level 5 model you can run on a 24 GB GPU?: DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.
What's the best MATH Level 5 model you can run on a 12 GB GPU?: DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.
Can open models match proprietary models on MATH Level 5?: Not quite on MATH Level 5: the strongest proprietary model (GPT 5 (Aug 07, 2025, high)) scores 98.1%, ahead of the best open model (DeepSeek R1 0528) at 96.6% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.