Math
MATH Level 5 Leaderboard
MATH Level 5 covers the hardest tier of competition-mathematics problems, testing multi-step symbolic and quantitative reasoning.
Source: epoch23 open models ranked+85 proprietaryData through Oct 2025
Open models ranked on MATH Level 5
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 9 | DeepSeek R1 0528 · 684.5B | 96.6% |
| 2 / 19 | DeepSeek R1 · 684.5B | 93.0% |
| 3 / 23 | DeepSeek R1 Distill Llama 70B · 70B | 89.9% |
| 4 / 28 | DeepSeek R1 Distill Qwen 14B · 14.8B | 87.1% |
| 5 / 40 | DeepSeek v3 0324 · 684.5B | 75.5% |
| 6 / 41 | Gemma 3 27B IT · 27.4B | 74.0% |
| 7 / 45 | Qwen3 235B A22B · 235.1B | 68.9% |
| 8 / 49 | Phi 4 · 14.7B | 64.9% |
| 9 / 52 | Qwen2.5 72B Instruct · 72.7B | 63.2% |
| 10 / 53 | Llama 4 Scout 17B 16E Instruct · 108.6B | 62.3% |
| 11 / 57 | Qwen2.5 32B Instruct · 32B | 56.1% |
| 12 / 71 | Llama 3.3 70B Instruct · 70.6B | 41.6% |
| 13 / 77 | Llama 3.1 70B Instruct · 70.6B | 36.7% |
| 14 / 79 | Gemma 2 27B IT · 27.2B | 27.9% |
| 15 / 81 | Yi 1.5 34B Chat · 34.4B | 25.5% |
| 16 / 86 | Llama 3.1 8B Instruct · 8.0B | 22.9% |
| 17 / 88 | Meta Llama 3 70B Instruct · 70.6B | 22.6% |
| 18 / 89 | Gemma 2 9B IT · 9.2B | 21.0% |
| 19 / 102 | Mixtral 8x7B Instruct v0.1 · 46.7B | 9.3% |
| 20 / 103 | Deepseek Llm 67B Chat · 67B | 6.4% |
| 21 / 104 | Meta Llama 3 8B Instruct · 8.0B | 6.1% |
| 22 / 107 | Mistral 7B Instruct v0.3 · 7.2B | 3.6% |
| 23 / 108 | Llama 2 70B Chat HF · 69.0B | 3.3% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Mistral 7B Instruct v0.3, 7B, score 3.6% — on the efficiency frontier (best score at its size or smaller).
- Llama 3.1 8B Instruct, 8B, score 22.9% — on the efficiency frontier (best score at its size or smaller).
- Phi 4, 15B, score 64.9% — on the efficiency frontier (best score at its size or smaller).
- DeepSeek R1 Distill Qwen 14B, 15B, score 87.1% — on the efficiency frontier (best score at its size or smaller).
- DeepSeek R1 Distill Llama 70B, 70B, score 89.9% — on the efficiency frontier (best score at its size or smaller).
- DeepSeek R1 0528, 685B, score 96.6% — on the efficiency frontier (best score at its size or smaller).
MATH Level 5: frequently asked questions
- What is the best open LLM on MATH Level 5?
- DeepSeek R1 0528 is the top open model on MATH Level 5, scoring 96.6%. Among all models tested — including proprietary ones — it ranks #9.
- What's the best MATH Level 5 model you can run on a 24 GB GPU?
- DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.
- What's the best MATH Level 5 model you can run on a 12 GB GPU?
- DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.
- Can open models match proprietary models on MATH Level 5?
- Not quite on MATH Level 5: the strongest proprietary model (gpt-5-2025-08-07_high) scores 98.1%, ahead of the best open model (DeepSeek R1 0528) at 96.6% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.