Math

AIME 2024/2025 Leaderboard

AIME (American Invitational Mathematics Examination) is a prestigious high-school competition of hard, integer-answer problems. It's a widely-cited yardstick for multi-step mathematical reasoning, here on the 2024–2025 papers.

Source: epoch22 open models ranked+119 proprietaryData through May 2026

Open models ranked on AIME 2024/2025

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 14GLM 5.1 · 753.9B
92.2%
2 / 28Qwen3 235B A22B Thinking 2507 · 235.1B
86.7%
3 / 38GLM 4.7 · 358.3B
83.3%
4 / 43GLM 5 · 753.9B
80.0%
5 / 60DeepSeek R1 0528 · 684.5B
66.4%
6 / 72DeepSeek R1 · 684.5B
53.3%
7 / 73DeepSeek R1 Distill Llama 70B · 70B
51.4%
8 / 82DeepSeek v3 0324 · 684.5B
37.8%
9 / 89Magistral Small 2506 · 23.6B
30.0%
10 / 95Gemma 3 27B IT · 27.4B
19.7%
11 / 100Phi 4 · 14.7B
13.8%
12 / 105Qwen2.5 72B Instruct · 72.7B
8.1%
13 / 106Llama 4 Scout 17B 16E Instruct · 108.6B
7.8%
14 / 108Qwen2.5 32B Instruct · 32B
7.4%
15 / 119Llama 3.3 70B Instruct · 70.6B
5.1%
16 / 124Meta Llama 3 70B Instruct · 70.6B
4.3%
17 / 126Llama 3.1 70B Instruct · 70.6B
3.6%
18 / 131Llama 3.1 8B Instruct · 8.0B
2.5%
19 / 135Gemma 2 27B IT · 27.2B
1.4%
20 / 138Meta Llama 3 8B Instruct · 8.0B
0.8%
21 / 139Gemma 2 9B IT · 9.2B
0.6%
22 / 141Llama 2 70B Chat HF · 69.0B
0.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

10B100Bmodel size (log scale) →92.2%0.0%GLM 4.7 · 358B · 83.3%GLM 5 · 754B · 80.0%DeepSeek R1 0528 · 685B · 66.4%DeepSeek R1 · 685B · 53.3%DeepSeek v3 0324 · 685B · 37.8%Gemma 3 27B IT · 27B · 19.7%Qwen2.5 72B Instruct · 73B · 8.1%Llama 4 Scout 17B 16E Instruct · 109B · 7.8%Qwen2.5 32B Instruct · 32B · 7.4%Llama 3.3 70B Instruct · 71B · 5.1%Meta Llama 3 70B Instruct · 71B · 4.3%Llama 3.1 70B Instruct · 71B · 3.6%Gemma 2 27B IT · 27B · 1.4%Meta Llama 3 8B Instruct · 8B · 0.8%Gemma 2 9B IT · 9B · 0.6%Llama 2 70B Chat HF · 69B · 0.0%Llama 3.1 8B Instruct · 8B · 2.5%Llama 3.1 8B InstructPhi 4 · 15B · 13.8%Phi 4Magistral Small 2506 · 24B · 30.0%Magistral Small 2506DeepSeek R1 Distill Llama 70B · 70B · 51.4%DeepSeek R1 Distill L…Qwen3 235B A22B Thinking 2507 · 235B · 86.7%Qwen3 235B A22B Think…GLM 5.1 · 754B · 92.2%GLM 5.1
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Llama 3.1 8B Instruct, 8B, score 2.5% — on the efficiency frontier (best score at its size or smaller).
  • Phi 4, 15B, score 13.8% — on the efficiency frontier (best score at its size or smaller).
  • Magistral Small 2506, 24B, score 30.0% — on the efficiency frontier (best score at its size or smaller).
  • DeepSeek R1 Distill Llama 70B, 70B, score 51.4% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3 235B A22B Thinking 2507, 235B, score 86.7% — on the efficiency frontier (best score at its size or smaller).
  • GLM 5.1, 754B, score 92.2% — on the efficiency frontier (best score at its size or smaller).

AIME 2024/2025: frequently asked questions

What is the best open LLM on AIME 2024/2025?
GLM 5.1 is the top open model on AIME 2024/2025, scoring 92.2%. Among all models tested — including proprietary ones — it ranks #14.
What's the best AIME 2024/2025 model you can run on a 24 GB GPU?
Magistral Small 2506 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 13 GB), scoring 30.0% on AIME 2024/2025.
What's the best AIME 2024/2025 model you can run on a 12 GB GPU?
Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 13.8% on AIME 2024/2025.
Can open models match proprietary models on AIME 2024/2025?
Not quite on AIME 2024/2025: the strongest proprietary model (gpt-5.5-pro-pre-release_xhigh) scores 100.0%, ahead of the best open model (GLM 5.1) at 92.2% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.