Math

MATH Level 5 Leaderboard

MATH Level 5 covers the hardest tier of competition-mathematics problems, testing multi-step symbolic and quantitative reasoning.

Source: epoch23 open models ranked+85 proprietaryData through Oct 2025

All models ranked on MATH Level 5

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gpt-5-2025-08-07_high · proprietary
98.1%
2gpt-5-2025-08-07_medium · proprietary
97.9%
3gpt-5-mini-2025-08-07_high · proprietary
97.9%
4o4-mini-2025-04-16_high · proprietary
97.8%
5o3-2025-04-16_high · proprietary
97.8%
6claude-sonnet-4-5-20250929_32K · proprietary
97.7%
7qwen3-max-2025-09-23 · proprietary
97.1%
8gpt-5-mini-2025-08-07_medium · proprietary
96.8%
9DeepSeek R1 0528 · 684.5B
96.6%
10o3-mini-2025-01-31_high · proprietary
96.5%
11claude-haiku-4-5-20251001_32K · proprietary
96.4%
12gemini-2.5-pro-preview-05-06 · proprietary
95.9%
13gemini-2.5-pro-preview-03-25 · proprietary
95.6%
14gpt-5-nano-2025-08-07_medium · proprietary
95.2%
15o3-mini-2025-01-31_medium · proprietary
95.2%
16gpt-5-nano-2025-08-07_high · proprietary
94.9%
17o1-2024-12-17_high · proprietary
94.7%
18o1-2024-12-17_medium · proprietary
94.4%
19DeepSeek R1 · 684.5B
93.0%
20claude-3-7-sonnet-20250219_64K · proprietary
91.2%
21grok-3-mini-beta_low · proprietary
90.9%
22claude-3-7-sonnet-20250219_32K · proprietary
90.0%
23DeepSeek R1 Distill Llama 70B · 70B
89.9%
24o1-mini-2024-09-12_high · proprietary
89.2%
25grok-3-beta · proprietary
88.8%
26grok-3-mini-beta_high · proprietary
88.1%
27gpt-4.1-mini-2025-04-14 · proprietary
87.3%
28DeepSeek R1 Distill Qwen 14B · 14.8B
87.1%
29claude-haiku-4-5-20251001 · proprietary
86.9%
30claude-3-7-sonnet-20250219_16K · proprietary
86.3%
31claude-opus-4-20250514 · proprietary
85.0%
32claude-sonnet-4-20250514 · proprietary
84.4%
33o1-mini-2024-09-12_medium · proprietary
84.3%
34gemini-2.0-pro-exp-02-05 · proprietary
83.5%
35gpt-4.1-2025-04-14 · proprietary
83.0%
36gemini-2.0-flash-001 · proprietary
82.2%
37o1-preview-2024-09-12 · proprietary
81.7%
38mistral-medium-2505 · proprietary
81.6%
39gpt-4.5-preview-2025-02-27 · proprietary
78.6%
40DeepSeek v3 0324 · 684.5B
75.5%
41Gemma 3 27B IT · 27.4B
74.0%
42Llama-4-Maverick-17B-128E-Instruct-FP8 · proprietary
73.0%
43gemini-1.5-pro-002 · proprietary
70.4%
44gpt-4.1-nano-2025-04-14 · proprietary
70.0%
45Qwen3 235B A22B · 235.1B
68.9%
46claude-3-7-sonnet-20250219 · proprietary
68.2%
47qwen-max-2025-01-25 · proprietary
67.2%
48qwen-plus-2025-01-25 · proprietary
65.3%
49Phi 4 · 14.7B
64.9%
50DeepSeek-V3 · proprietary
64.8%
51grok-2-1212 · proprietary
63.5%
52Qwen2.5 72B Instruct · 72.7B
63.2%
53Llama 4 Scout 17B 16E Instruct · 108.6B
62.3%
54gemini-1.5-flash-002 · proprietary
61.9%
55claude-3-5-sonnet-20241022 · proprietary
57.0%
56qwen-turbo-2024-11-01 · proprietary
56.2%
57Qwen2.5 32B Instruct · 32B
56.1%
58gpt-4o-2024-08-06 · proprietary
53.3%
59gpt-4o-mini-2024-07-18 · proprietary
52.6%
60claude-3-5-sonnet-20240620 · proprietary
51.7%
61gpt-4o-2024-05-13 · proprietary
51.0%
62mistral-large-2411 · proprietary
50.3%
63gpt-4o-2024-11-20 · proprietary
49.8%
64Llama-3.1-405B-Instruct · proprietary
49.8%
65mistral-small-2503 · proprietary
46.8%
66gpt-4-turbo-2024-04-09 · proprietary
46.7%
67claude-3-5-haiku-20241022 · proprietary
46.4%
68mistral-large-2407 · proprietary
44.8%
69mistral-small-2501 · proprietary
44.8%
70Llama-3.1-Tulu-3-70B-DPO · proprietary
42.7%
71Llama 3.3 70B Instruct · 70.6B
41.6%
72gemini-1.5-pro-001 · proprietary
40.8%
73gpt-4-1106-preview · proprietary
40.0%
74Llama-3.2-90B-Vision-Instruct · proprietary
39.4%
75qwen2-72b-instruct · proprietary
39.1%
76claude-3-opus-20240229 · proprietary
37.5%
77Llama 3.1 70B Instruct · 70.6B
36.7%
78gpt-4-0125-preview · proprietary
35.4%
79Gemma 2 27B IT · 27.2B
27.9%
80WizardLM-2-8x22B · proprietary
25.7%
81Yi 1.5 34B Chat · 34.4B
25.5%
82gemini-1.5-flash-001 · proprietary
25.1%
83mistral-large-2402 · proprietary
24.5%
84open-mixtral-8x22b · proprietary
24.2%
85gpt-4-0613 · proprietary
23.0%
86Llama 3.1 8B Instruct · 8.0B
22.9%
87Hermes-2-Theta-Llama-3-70B · proprietary
22.7%
88Meta Llama 3 70B Instruct · 70.6B
22.6%
89Gemma 2 9B IT · 9.2B
21.0%
90claude-3-sonnet-20240229 · proprietary
18.2%
91Phi-3-medium-128k-instruct · proprietary
17.6%
92gpt-3.5-turbo-1106 · proprietary
15.9%
93ministral-8b-2410 · proprietary
14.9%
94claude-3-haiku-20240307 · proprietary
14.9%
95ministral-3b-2410 · proprietary
14.4%
96claude-2.0 · proprietary
11.7%
97dbrx-instruct · proprietary
11.7%
98gpt-3.5-turbo-0125 · proprietary
11.6%
99gemini-1.0-pro-001 · proprietary
11.2%
100open-mistral-nemo-2407 · proprietary
10.8%
101open-mixtral-8x7b · proprietary
10.0%
102Mixtral 8x7B Instruct v0.1 · 46.7B
9.3%
103Deepseek Llm 67B Chat · 67B
6.4%
104Meta Llama 3 8B Instruct · 8.0B
6.1%
105Yi-34B-Chat · proprietary
5.1%
106open-mistral-7b · proprietary
3.7%
107Mistral 7B Instruct v0.3 · 7.2B
3.6%
108Llama 2 70B Chat HF · 69.0B
3.3%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

10B100Bmodel size (log scale) →96.6%3.3%DeepSeek R1 · 685B · 93.0%DeepSeek v3 0324 · 685B · 75.5%Gemma 3 27B IT · 27B · 74.0%Qwen3 235B A22B · 235B · 68.9%Qwen2.5 72B Instruct · 73B · 63.2%Llama 4 Scout 17B 16E Instruct · 109B · 62.3%Qwen2.5 32B Instruct · 32B · 56.1%Llama 3.3 70B Instruct · 71B · 41.6%Llama 3.1 70B Instruct · 71B · 36.7%Gemma 2 27B IT · 27B · 27.9%Yi 1.5 34B Chat · 34B · 25.5%Meta Llama 3 70B Instruct · 71B · 22.6%Gemma 2 9B IT · 9B · 21.0%Mixtral 8x7B Instruct v0.1 · 47B · 9.3%Deepseek Llm 67B Chat · 67B · 6.4%Meta Llama 3 8B Instruct · 8B · 6.1%Llama 2 70B Chat HF · 69B · 3.3%Mistral 7B Instruct v0.3 · 7B · 3.6%Mistral 7B Instruct v…Llama 3.1 8B Instruct · 8B · 22.9%Llama 3.1 8B InstructPhi 4 · 15B · 64.9%Phi 4DeepSeek R1 Distill Qwen 14B · 15B · 87.1%DeepSeek R1 Distill Q…DeepSeek R1 Distill Llama 70B · 70B · 89.9%DeepSeek R1 Distill L…DeepSeek R1 0528 · 685B · 96.6%DeepSeek R1 0528
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Mistral 7B Instruct v0.3, 7B, score 3.6% — on the efficiency frontier (best score at its size or smaller).
  • Llama 3.1 8B Instruct, 8B, score 22.9% — on the efficiency frontier (best score at its size or smaller).
  • Phi 4, 15B, score 64.9% — on the efficiency frontier (best score at its size or smaller).
  • DeepSeek R1 Distill Qwen 14B, 15B, score 87.1% — on the efficiency frontier (best score at its size or smaller).
  • DeepSeek R1 Distill Llama 70B, 70B, score 89.9% — on the efficiency frontier (best score at its size or smaller).
  • DeepSeek R1 0528, 685B, score 96.6% — on the efficiency frontier (best score at its size or smaller).

MATH Level 5: frequently asked questions

What is the best open LLM on MATH Level 5?
DeepSeek R1 0528 is the top open model on MATH Level 5, scoring 96.6%. Among all models tested — including proprietary ones — it ranks #9.
What's the best MATH Level 5 model you can run on a 24 GB GPU?
DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.
What's the best MATH Level 5 model you can run on a 12 GB GPU?
DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.
Can open models match proprietary models on MATH Level 5?
Not quite on MATH Level 5: the strongest proprietary model (gpt-5-2025-08-07_high) scores 98.1%, ahead of the best open model (DeepSeek R1 0528) at 96.6% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.