Math
MATH Level 5 Leaderboard
MATH Level 5 covers the hardest tier of competition-mathematics problems, testing multi-step symbolic and quantitative reasoning.
Source: epoch23 open models ranked+85 proprietaryData through Oct 2025
All models ranked on MATH Level 5
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5-2025-08-07_high · proprietary | 98.1% |
| 2 | gpt-5-2025-08-07_medium · proprietary | 97.9% |
| 3 | gpt-5-mini-2025-08-07_high · proprietary | 97.9% |
| 4 | o4-mini-2025-04-16_high · proprietary | 97.8% |
| 5 | o3-2025-04-16_high · proprietary | 97.8% |
| 6 | claude-sonnet-4-5-20250929_32K · proprietary | 97.7% |
| 7 | qwen3-max-2025-09-23 · proprietary | 97.1% |
| 8 | gpt-5-mini-2025-08-07_medium · proprietary | 96.8% |
| 9 | DeepSeek R1 0528 · 684.5B | 96.6% |
| 10 | o3-mini-2025-01-31_high · proprietary | 96.5% |
| 11 | claude-haiku-4-5-20251001_32K · proprietary | 96.4% |
| 12 | gemini-2.5-pro-preview-05-06 · proprietary | 95.9% |
| 13 | gemini-2.5-pro-preview-03-25 · proprietary | 95.6% |
| 14 | gpt-5-nano-2025-08-07_medium · proprietary | 95.2% |
| 15 | o3-mini-2025-01-31_medium · proprietary | 95.2% |
| 16 | gpt-5-nano-2025-08-07_high · proprietary | 94.9% |
| 17 | o1-2024-12-17_high · proprietary | 94.7% |
| 18 | o1-2024-12-17_medium · proprietary | 94.4% |
| 19 | DeepSeek R1 · 684.5B | 93.0% |
| 20 | claude-3-7-sonnet-20250219_64K · proprietary | 91.2% |
| 21 | grok-3-mini-beta_low · proprietary | 90.9% |
| 22 | claude-3-7-sonnet-20250219_32K · proprietary | 90.0% |
| 23 | DeepSeek R1 Distill Llama 70B · 70B | 89.9% |
| 24 | o1-mini-2024-09-12_high · proprietary | 89.2% |
| 25 | grok-3-beta · proprietary | 88.8% |
| 26 | grok-3-mini-beta_high · proprietary | 88.1% |
| 27 | gpt-4.1-mini-2025-04-14 · proprietary | 87.3% |
| 28 | DeepSeek R1 Distill Qwen 14B · 14.8B | 87.1% |
| 29 | claude-haiku-4-5-20251001 · proprietary | 86.9% |
| 30 | claude-3-7-sonnet-20250219_16K · proprietary | 86.3% |
| 31 | claude-opus-4-20250514 · proprietary | 85.0% |
| 32 | claude-sonnet-4-20250514 · proprietary | 84.4% |
| 33 | o1-mini-2024-09-12_medium · proprietary | 84.3% |
| 34 | gemini-2.0-pro-exp-02-05 · proprietary | 83.5% |
| 35 | gpt-4.1-2025-04-14 · proprietary | 83.0% |
| 36 | gemini-2.0-flash-001 · proprietary | 82.2% |
| 37 | o1-preview-2024-09-12 · proprietary | 81.7% |
| 38 | mistral-medium-2505 · proprietary | 81.6% |
| 39 | gpt-4.5-preview-2025-02-27 · proprietary | 78.6% |
| 40 | DeepSeek v3 0324 · 684.5B | 75.5% |
| 41 | Gemma 3 27B IT · 27.4B | 74.0% |
| 42 | Llama-4-Maverick-17B-128E-Instruct-FP8 · proprietary | 73.0% |
| 43 | gemini-1.5-pro-002 · proprietary | 70.4% |
| 44 | gpt-4.1-nano-2025-04-14 · proprietary | 70.0% |
| 45 | Qwen3 235B A22B · 235.1B | 68.9% |
| 46 | claude-3-7-sonnet-20250219 · proprietary | 68.2% |
| 47 | qwen-max-2025-01-25 · proprietary | 67.2% |
| 48 | qwen-plus-2025-01-25 · proprietary | 65.3% |
| 49 | Phi 4 · 14.7B | 64.9% |
| 50 | DeepSeek-V3 · proprietary | 64.8% |
| 51 | grok-2-1212 · proprietary | 63.5% |
| 52 | Qwen2.5 72B Instruct · 72.7B | 63.2% |
| 53 | Llama 4 Scout 17B 16E Instruct · 108.6B | 62.3% |
| 54 | gemini-1.5-flash-002 · proprietary | 61.9% |
| 55 | claude-3-5-sonnet-20241022 · proprietary | 57.0% |
| 56 | qwen-turbo-2024-11-01 · proprietary | 56.2% |
| 57 | Qwen2.5 32B Instruct · 32B | 56.1% |
| 58 | gpt-4o-2024-08-06 · proprietary | 53.3% |
| 59 | gpt-4o-mini-2024-07-18 · proprietary | 52.6% |
| 60 | claude-3-5-sonnet-20240620 · proprietary | 51.7% |
| 61 | gpt-4o-2024-05-13 · proprietary | 51.0% |
| 62 | mistral-large-2411 · proprietary | 50.3% |
| 63 | gpt-4o-2024-11-20 · proprietary | 49.8% |
| 64 | Llama-3.1-405B-Instruct · proprietary | 49.8% |
| 65 | mistral-small-2503 · proprietary | 46.8% |
| 66 | gpt-4-turbo-2024-04-09 · proprietary | 46.7% |
| 67 | claude-3-5-haiku-20241022 · proprietary | 46.4% |
| 68 | mistral-large-2407 · proprietary | 44.8% |
| 69 | mistral-small-2501 · proprietary | 44.8% |
| 70 | Llama-3.1-Tulu-3-70B-DPO · proprietary | 42.7% |
| 71 | Llama 3.3 70B Instruct · 70.6B | 41.6% |
| 72 | gemini-1.5-pro-001 · proprietary | 40.8% |
| 73 | gpt-4-1106-preview · proprietary | 40.0% |
| 74 | Llama-3.2-90B-Vision-Instruct · proprietary | 39.4% |
| 75 | qwen2-72b-instruct · proprietary | 39.1% |
| 76 | claude-3-opus-20240229 · proprietary | 37.5% |
| 77 | Llama 3.1 70B Instruct · 70.6B | 36.7% |
| 78 | gpt-4-0125-preview · proprietary | 35.4% |
| 79 | Gemma 2 27B IT · 27.2B | 27.9% |
| 80 | WizardLM-2-8x22B · proprietary | 25.7% |
| 81 | Yi 1.5 34B Chat · 34.4B | 25.5% |
| 82 | gemini-1.5-flash-001 · proprietary | 25.1% |
| 83 | mistral-large-2402 · proprietary | 24.5% |
| 84 | open-mixtral-8x22b · proprietary | 24.2% |
| 85 | gpt-4-0613 · proprietary | 23.0% |
| 86 | Llama 3.1 8B Instruct · 8.0B | 22.9% |
| 87 | Hermes-2-Theta-Llama-3-70B · proprietary | 22.7% |
| 88 | Meta Llama 3 70B Instruct · 70.6B | 22.6% |
| 89 | Gemma 2 9B IT · 9.2B | 21.0% |
| 90 | claude-3-sonnet-20240229 · proprietary | 18.2% |
| 91 | Phi-3-medium-128k-instruct · proprietary | 17.6% |
| 92 | gpt-3.5-turbo-1106 · proprietary | 15.9% |
| 93 | ministral-8b-2410 · proprietary | 14.9% |
| 94 | claude-3-haiku-20240307 · proprietary | 14.9% |
| 95 | ministral-3b-2410 · proprietary | 14.4% |
| 96 | claude-2.0 · proprietary | 11.7% |
| 97 | dbrx-instruct · proprietary | 11.7% |
| 98 | gpt-3.5-turbo-0125 · proprietary | 11.6% |
| 99 | gemini-1.0-pro-001 · proprietary | 11.2% |
| 100 | open-mistral-nemo-2407 · proprietary | 10.8% |
| 101 | open-mixtral-8x7b · proprietary | 10.0% |
| 102 | Mixtral 8x7B Instruct v0.1 · 46.7B | 9.3% |
| 103 | Deepseek Llm 67B Chat · 67B | 6.4% |
| 104 | Meta Llama 3 8B Instruct · 8.0B | 6.1% |
| 105 | Yi-34B-Chat · proprietary | 5.1% |
| 106 | open-mistral-7b · proprietary | 3.7% |
| 107 | Mistral 7B Instruct v0.3 · 7.2B | 3.6% |
| 108 | Llama 2 70B Chat HF · 69.0B | 3.3% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Mistral 7B Instruct v0.3, 7B, score 3.6% — on the efficiency frontier (best score at its size or smaller).
- Llama 3.1 8B Instruct, 8B, score 22.9% — on the efficiency frontier (best score at its size or smaller).
- Phi 4, 15B, score 64.9% — on the efficiency frontier (best score at its size or smaller).
- DeepSeek R1 Distill Qwen 14B, 15B, score 87.1% — on the efficiency frontier (best score at its size or smaller).
- DeepSeek R1 Distill Llama 70B, 70B, score 89.9% — on the efficiency frontier (best score at its size or smaller).
- DeepSeek R1 0528, 685B, score 96.6% — on the efficiency frontier (best score at its size or smaller).
MATH Level 5: frequently asked questions
- What is the best open LLM on MATH Level 5?
- DeepSeek R1 0528 is the top open model on MATH Level 5, scoring 96.6%. Among all models tested — including proprietary ones — it ranks #9.
- What's the best MATH Level 5 model you can run on a 24 GB GPU?
- DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.
- What's the best MATH Level 5 model you can run on a 12 GB GPU?
- DeepSeek R1 Distill Qwen 14B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 87.1% on MATH Level 5.
- Can open models match proprietary models on MATH Level 5?
- Not quite on MATH Level 5: the strongest proprietary model (gpt-5-2025-08-07_high) scores 98.1%, ahead of the best open model (DeepSeek R1 0528) at 96.6% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.