Math

MATH Level 5 Leaderboard

MATH Level 5 covers the hardest tier of competition-mathematics problems, testing multi-step symbolic and quantitative reasoning.

Source: epoch23 open models ranked+85 proprietaryData through Oct 2025

Open models ranked on MATH Level 5

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 9DeepSeek R1 0528 · 684.5B
96.6%
2 / 19DeepSeek R1 · 684.5B
93.0%
3 / 23DeepSeek R1 Distill Llama 70B · 70B
89.9%
4 / 28DeepSeek R1 Distill Qwen 14B · 14.8B
87.1%
5 / 40DeepSeek v3 0324 · 684.5B
75.5%
6 / 41Gemma 3 27B IT · 27.4B
74.0%
7 / 45Qwen3 235B A22B · 235.1B
68.9%
8 / 49Phi 4 · 14.7B
64.9%
9 / 52Qwen2.5 72B Instruct · 72.7B
63.2%
10 / 53Llama 4 Scout 17B 16E Instruct · 108.6B
62.3%
11 / 57Qwen2.5 32B Instruct · 32B
56.1%
12 / 71Llama 3.3 70B Instruct · 70.6B
41.6%
13 / 77Llama 3.1 70B Instruct · 70.6B
36.7%
14 / 79Gemma 2 27B IT · 27.2B
27.9%
15 / 81Yi 1.5 34B Chat · 34.4B
25.5%
16 / 86Llama 3.1 8B Instruct · 8.0B
22.9%
17 / 88Meta Llama 3 70B Instruct · 70.6B
22.6%
18 / 89Gemma 2 9B IT · 9.2B
21.0%
19 / 102Mixtral 8x7B Instruct v0.1 · 46.7B
9.3%
20 / 103Deepseek Llm 67B Chat · 67B
6.4%
21 / 104Meta Llama 3 8B Instruct · 8.0B
6.1%
22 / 107Mistral 7B Instruct v0.3 · 7.2B
3.6%
23 / 108Llama 2 70B Chat HF · 69.0B
3.3%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

10B100Bmodel size (log scale) →96.6%3.3%DeepSeek R1 · 685B · 93.0%DeepSeek v3 0324 · 685B · 75.5%Gemma 3 27B IT · 27B · 74.0%Qwen3 235B A22B · 235B · 68.9%Qwen2.5 72B Instruct · 73B · 63.2%Llama 4 Scout 17B 16E Instruct · 109B · 62.3%Qwen2.5 32B Instruct · 32B · 56.1%Llama 3.3 70B Instruct · 71B · 41.6%Llama 3.1 70B Instruct · 71B · 36.7%Gemma 2 27B IT · 27B · 27.9%Yi 1.5 34B Chat · 34B · 25.5%Meta Llama 3 70B Instruct · 71B · 22.6%Gemma 2 9B IT · 9B · 21.0%Mixtral 8x7B Instruct v0.1 · 47B · 9.3%Deepseek Llm 67B Chat · 67B · 6.4%Meta Llama 3 8B Instruct · 8B · 6.1%Llama 2 70B Chat HF · 69B · 3.3%Mistral 7B Instruct v0.3 · 7B · 3.6%Mistral 7B Instruct v…Llama 3.1 8B Instruct · 8B · 22.9%Llama 3.1 8B InstructPhi 4 · 15B · 64.9%Phi 4DeepSeek R1 Distill Qwen 14B · 15B · 87.1%DeepSeek R1 Distill Q…DeepSeek R1 Distill Llama 70B · 70B · 89.9%DeepSeek R1 Distill L…DeepSeek R1 0528 · 685B · 96.6%DeepSeek R1 0528
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Mistral 7B Instruct v0.3, 7B, score 3.6% — on the efficiency frontier (best score at its size or smaller).
  • Llama 3.1 8B Instruct, 8B, score 22.9% — on the efficiency frontier (best score at its size or smaller).
  • Phi 4, 15B, score 64.9% — on the efficiency frontier (best score at its size or smaller).
  • DeepSeek R1 Distill Qwen 14B, 15B, score 87.1% — on the efficiency frontier (best score at its size or smaller).
  • DeepSeek R1 Distill Llama 70B, 70B, score 89.9% — on the efficiency frontier (best score at its size or smaller).
  • DeepSeek R1 0528, 685B, score 96.6% — on the efficiency frontier (best score at its size or smaller).

MATH Level 5: frequently asked questions

What is the best open LLM on MATH Level 5?
DeepSeek R1 0528 is the top open model on MATH Level 5, scoring 96.6%. Among all models tested — including proprietary ones — it ranks #9.
What's the best MATH Level 5 model you can run on a 24 GB GPU?
DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.
What's the best MATH Level 5 model you can run on a 12 GB GPU?
DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.
Can open models match proprietary models on MATH Level 5?
Not quite on MATH Level 5: the strongest proprietary model (gpt-5-2025-08-07_high) scores 98.1%, ahead of the best open model (DeepSeek R1 0528) at 96.6% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.