What is the best open LLM on MATH Level 5?

DeepSeek R1 0528 is the top open model on MATH Level 5, scoring 96.6%. Among all models tested — including proprietary ones — it ranks #9. The top model overall is GPT 5 (Aug 07, 2025, high) (OpenAI) at 98.1%.

What's the best MATH Level 5 model you can run on a 24 GB GPU?

DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.

What's the best MATH Level 5 model you can run on a 12 GB GPU?

DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.

Can open models match proprietary models on MATH Level 5?

Not quite on MATH Level 5: the strongest proprietary model (GPT 5 (Aug 07, 2025, high)) scores 98.1%, ahead of the best open model (DeepSeek R1 0528) at 96.6% — but you can run the open one yourself.

Math

MATH Level 5 Leaderboard

Name: MATH Level 5 — open LLM scores
Creator: epoch

MATH Level 5 covers the hardest tier of competition-mathematics problems, testing multi-step symbolic and quantitative reasoning.

Source: epoch32 open models ranked+76 proprietaryData through Oct 2025

Open models All models

All models ranked on MATH Level 5

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#	Model	Score
1	GPT 5 (Aug 07, 2025, high) · proprietary	98.1%
2	GPT 5 (Aug 07, 2025, medium) · proprietary	97.9%
3	GPT 5 Mini (Aug 07, 2025, high) · proprietary	97.9%
4	O4 Mini (Apr 16, 2025, high) · proprietary	97.8%
5	O3 (Apr 16, 2025, high) · proprietary	97.8%
6	Claude Sonnet 4.5 (Sep 29, 2025, 32K) · proprietary	97.7%
7	Qwen3 Max (Sep 23, 2025) · proprietary	97.1%
8	GPT 5 Mini (Aug 07, 2025, medium) · proprietary	96.8%
9	DeepSeek R1 0528 · 684.5B	96.6%
10	O3 Mini (Jan 31, 2025, high) · proprietary	96.5%
11	Claude Haiku 4.5 (Oct 01, 2025, 32K) · proprietary	96.4%
12	Gemini 2.5 Pro Preview (May 06) · proprietary	95.9%
13	Gemini 2.5 Pro Preview (Mar 25) · proprietary	95.6%
14	GPT 5 Nano (Aug 07, 2025, medium) · proprietary	95.2%
15	O3 Mini (Jan 31, 2025, medium) · proprietary	95.2%
16	GPT 5 Nano (Aug 07, 2025, high) · proprietary	94.9%
17	O1 (Dec 17, 2024, high) · proprietary	94.7%
18	O1 (Dec 17, 2024, medium) · proprietary	94.4%
19	DeepSeek R1 · 684.5B	93.0%
20	Claude 3.7 Sonnet (Feb 19, 2025, 64K) · proprietary	91.2%
21	Grok 3 Mini Beta (low) · proprietary	90.9%
22	Claude 3.7 Sonnet (Feb 19, 2025, 32K) · proprietary	90.0%
23	DeepSeek R1 Distill Llama 70B · 70.6B	89.9%
24	O1 Mini (Sep 12, 2024, high) · proprietary	89.2%
25	Grok 3 Beta · proprietary	88.8%
26	Grok 3 Mini Beta (high) · proprietary	88.1%
27	GPT 4.1 Mini (Apr 14, 2025) · proprietary	87.3%
28	DeepSeek R1 Distill Qwen 14B · 14.8B	87.1%
29	Claude Haiku 4.5 (Oct 01, 2025) · proprietary	86.9%
30	Claude 3.7 Sonnet (Feb 19, 2025, 16K) · proprietary	86.3%
31	Claude Opus 4 (May 14, 2025) · proprietary	85.0%
32	Claude Sonnet 4 (May 14, 2025) · proprietary	84.4%
33	O1 Mini (Sep 12, 2024, medium) · proprietary	84.3%
34	Gemini 2.0 Pro Exp (Feb 05) · proprietary	83.5%
35	GPT 4.1 (Apr 14, 2025) · proprietary	83.0%
36	Gemini 2.0 Flash 001 · proprietary	82.2%
37	O1 Preview (Sep 12, 2024) · proprietary	81.7%
38	Mistral Medium 2505 · proprietary	81.6%
39	GPT 4.5 Preview (Feb 27, 2025) · proprietary	78.6%
40	DeepSeek v3 0324 · 684.5B	75.5%
41	Gemma 3 27B IT · 27.4B	74.0%
42	Llama 4 Maverick 17B 128E Instruct · 401.6B	73.0%
43	Gemini 1.5 Pro 002 · proprietary	70.4%
44	GPT 4.1 Nano (Apr 14, 2025) · proprietary	70.0%
45	Qwen3 235B A22B · 235.1B	68.9%
46	Claude 3.7 Sonnet (Feb 19, 2025) · proprietary	68.2%
47	Qwen Max (Jan 25, 2025) · proprietary	67.2%
48	Qwen Plus (Jan 25, 2025) · proprietary	65.3%
49	Phi 4 · 14.7B	64.9%
50	DeepSeek v3 · 684.5B	64.8%
51	Grok 2 (Dec 12) · proprietary	63.5%
52	Qwen2.5 72B Instruct · 72.7B	63.2%
53	Llama 4 Scout 17B 16E Instruct · 108.6B	62.3%
54	Gemini 1.5 Flash 002 · proprietary	61.9%
55	Claude 3.5 Sonnet (Oct 22, 2024) · proprietary	57.0%
56	Qwen Turbo (Nov 01, 2024) · proprietary	56.2%
57	Qwen2.5 32B Instruct · 32.8B	56.1%
58	GPT 4o (Aug 06, 2024) · proprietary	53.3%
59	GPT 4o Mini (Jul 18, 2024) · proprietary	52.6%
60	Claude 3.5 Sonnet (Jun 20, 2024) · proprietary	51.7%
61	GPT 4o (May 13, 2024) · proprietary	51.0%
62	Mistral Large 2411 · proprietary	50.3%
63	GPT 4o (Nov 20, 2024) · proprietary	49.8%
64	Llama 3.1 405B Instruct · 405.9B	49.8%
65	Mistral Small 2503 · proprietary	46.8%
66	GPT 4 Turbo (Apr 09, 2024) · proprietary	46.7%
67	Claude 3.5 Haiku (Oct 22, 2024) · proprietary	46.4%
68	Mistral Large 2407 · proprietary	44.8%
69	Mistral Small 2501 · proprietary	44.8%
70	Llama 3.1 Tulu 3 70B DPO · 70.6B	42.7%
71	Llama 3.3 70B Instruct · 70.6B	41.6%
72	Gemini 1.5 Pro 001 · proprietary	40.8%
73	GPT 4 1106 Preview · proprietary	40.0%
74	Llama 3.2 90B Vision Instruct · 88.6B	39.4%
75	Qwen2 72B Instruct · 72.7B	39.1%
76	Claude 3 Opus (Feb 29, 2024) · proprietary	37.5%
77	Llama 3.1 70B Instruct · 70.6B	36.7%
78	GPT 4 0125 Preview · proprietary	35.4%
79	Gemma 2 27B IT · 27.2B	27.9%
80	WizardLM 2 8x22B · 140.6B	25.7%
81	Yi 1.5 34B Chat · 34.4B	25.5%
82	Gemini 1.5 Flash 001 · proprietary	25.1%
83	Mistral Large 2402 · proprietary	24.5%
84	Open Mixtral 8x22b · proprietary	24.2%
85	GPT 4 (Jun 13) · proprietary	23.0%
86	Llama 3.1 8B Instruct · 8.0B	22.9%
87	Hermes 2 Theta Llama 3 70B · 70.6B	22.7%
88	Meta Llama 3 70B Instruct · 70.6B	22.6%
89	Gemma 2 9B IT · 9.2B	21.0%
90	Claude 3 Sonnet (Feb 29, 2024) · proprietary	18.2%
91	Phi 3 Medium 128K Instruct · proprietary	17.6%
92	GPT 3.5 Turbo (Nov 06) · proprietary	15.9%
93	Ministral 8B 2410 · proprietary	14.9%
94	Claude 3 Haiku (Mar 07, 2024) · proprietary	14.9%
95	Ministral 3B 2410 · proprietary	14.4%
96	Claude 2.0 · proprietary	11.7%
97	Dbrx Instruct · proprietary	11.7%
98	GPT 3.5 Turbo (Jan 25) · proprietary	11.6%
99	Gemini 1.0 Pro 001 · proprietary	11.2%
100	Open Mistral Nemo 2407 · proprietary	10.8%
101	Open Mixtral 8x7b · proprietary	10.0%
102	Mixtral 8x7B Instruct v0.1 · 46.7B	9.3%
103	Deepseek Llm 67B Chat · 67B	6.4%
104	Meta Llama 3 8B Instruct · 8.0B	6.1%
105	Yi 34B Chat · 34.4B	5.1%
106	Open Mistral 7B · proprietary	3.7%
107	Mistral 7B Instruct v0.3 · 7.2B	3.6%
108	Llama 2 70B Chat HF · 69.0B	3.3%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

MATH Level 5: frequently asked questions

What is the best open LLM on MATH Level 5?: DeepSeek R1 0528 is the top open model on MATH Level 5, scoring 96.6%. Among all models tested — including proprietary ones — it ranks #9. The top model overall is GPT 5 (Aug 07, 2025, high) (OpenAI) at 98.1%.
What's the best MATH Level 5 model you can run on a 24 GB GPU?: DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.
What's the best MATH Level 5 model you can run on a 12 GB GPU?: DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.
Can open models match proprietary models on MATH Level 5?: Not quite on MATH Level 5: the strongest proprietary model (GPT 5 (Aug 07, 2025, high)) scores 98.1%, ahead of the best open model (DeepSeek R1 0528) at 96.6% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.