Math
AIME 2024/2025 Leaderboard
AIME (American Invitational Mathematics Examination) is a prestigious high-school competition of hard, integer-answer problems. It's a widely-cited yardstick for multi-step mathematical reasoning, here on the 2024–2025 papers.
Source: epoch22 open models ranked+119 proprietaryData through May 2026
Open models ranked on AIME 2024/2025
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 14 | GLM 5.1 · 753.9B | 92.2% |
| 2 / 28 | Qwen3 235B A22B Thinking 2507 · 235.1B | 86.7% |
| 3 / 38 | GLM 4.7 · 358.3B | 83.3% |
| 4 / 43 | GLM 5 · 753.9B | 80.0% |
| 5 / 60 | DeepSeek R1 0528 · 684.5B | 66.4% |
| 6 / 72 | DeepSeek R1 · 684.5B | 53.3% |
| 7 / 73 | DeepSeek R1 Distill Llama 70B · 70B | 51.4% |
| 8 / 82 | DeepSeek v3 0324 · 684.5B | 37.8% |
| 9 / 89 | Magistral Small 2506 · 23.6B | 30.0% |
| 10 / 95 | Gemma 3 27B IT · 27.4B | 19.7% |
| 11 / 100 | Phi 4 · 14.7B | 13.8% |
| 12 / 105 | Qwen2.5 72B Instruct · 72.7B | 8.1% |
| 13 / 106 | Llama 4 Scout 17B 16E Instruct · 108.6B | 7.8% |
| 14 / 108 | Qwen2.5 32B Instruct · 32B | 7.4% |
| 15 / 119 | Llama 3.3 70B Instruct · 70.6B | 5.1% |
| 16 / 124 | Meta Llama 3 70B Instruct · 70.6B | 4.3% |
| 17 / 126 | Llama 3.1 70B Instruct · 70.6B | 3.6% |
| 18 / 131 | Llama 3.1 8B Instruct · 8.0B | 2.5% |
| 19 / 135 | Gemma 2 27B IT · 27.2B | 1.4% |
| 20 / 138 | Meta Llama 3 8B Instruct · 8.0B | 0.8% |
| 21 / 139 | Gemma 2 9B IT · 9.2B | 0.6% |
| 22 / 141 | Llama 2 70B Chat HF · 69.0B | 0.0% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Llama 3.1 8B Instruct, 8B, score 2.5% — on the efficiency frontier (best score at its size or smaller).
- Phi 4, 15B, score 13.8% — on the efficiency frontier (best score at its size or smaller).
- Magistral Small 2506, 24B, score 30.0% — on the efficiency frontier (best score at its size or smaller).
- DeepSeek R1 Distill Llama 70B, 70B, score 51.4% — on the efficiency frontier (best score at its size or smaller).
- Qwen3 235B A22B Thinking 2507, 235B, score 86.7% — on the efficiency frontier (best score at its size or smaller).
- GLM 5.1, 754B, score 92.2% — on the efficiency frontier (best score at its size or smaller).
AIME 2024/2025: frequently asked questions
- What is the best open LLM on AIME 2024/2025?
- GLM 5.1 is the top open model on AIME 2024/2025, scoring 92.2%. Among all models tested — including proprietary ones — it ranks #14.
- What's the best AIME 2024/2025 model you can run on a 24 GB GPU?
- Magistral Small 2506 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 13 GB), scoring 30.0% on AIME 2024/2025.
- What's the best AIME 2024/2025 model you can run on a 12 GB GPU?
- Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 13.8% on AIME 2024/2025.
- Can open models match proprietary models on AIME 2024/2025?
- Not quite on AIME 2024/2025: the strongest proprietary model (gpt-5.5-pro-pre-release_xhigh) scores 100.0%, ahead of the best open model (GLM 5.1) at 92.2% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.