Math

AIME 2024/2025 Leaderboard

AIME (American Invitational Mathematics Examination) is a prestigious high-school competition of hard, integer-answer problems. It's a widely-cited yardstick for multi-step mathematical reasoning, here on the 2024–2025 papers.

Source: epoch22 open models ranked+119 proprietaryData through May 2026

All models ranked on AIME 2024/2025

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gpt-5.5-pre-release_xhigh · proprietary
100.0%
2gpt-5.5-pro-pre-release_xhigh · proprietary
100.0%
3claude-opus-4-7_xhigh · proprietary
97.8%
4gpt-5.2-2025-12-11_high · proprietary
96.1%
5kimi-k2.6 · proprietary
96.1%
6gpt-5.2-2025-12-11_xhigh · proprietary
96.1%
7gemini-3.1-pro-preview · proprietary
95.6%
8gemini-3.5-flash_high · proprietary
95.6%
9gpt-5.4-2026-03-05_xhigh · proprietary
95.3%
10claude-opus-4-6_64K · proprietary
94.4%
11gpt-5.2-2025-12-11_medium · proprietary
93.9%
12claude-opus-4-6_32K · proprietary
93.1%
13gemini-3-flash-preview · proprietary
92.8%
14GLM 5.1 · 753.9B
92.2%
15fireworks/kimi-k2p5 · proprietary
92.2%
16gemini-3-pro-preview · proprietary
91.4%
17gpt-5-2025-08-07_high · proprietary
91.4%
18qwen3.6-max-preview · proprietary
91.1%
19qwen3.6-plus · proprietary
90.6%
20muse-spark · proprietary
88.9%
21openai/gpt-oss-120b_high · proprietary
88.9%
22gpt-5.1-2025-11-13_high · proprietary
88.6%
23deepseek-reasoner · proprietary
87.8%
24gpt-5.4-nano-2026-03-17_high · proprietary
87.8%
25gpt-5-2025-08-07_medium · proprietary
87.2%
26gpt-5.4-mini-2026-03-17_high · proprietary
87.2%
27gpt-5-mini-2025-08-07_high · proprietary
86.7%
28Qwen3 235B A22B Thinking 2507 · 235.1B
86.7%
29claude-opus-4-5-20251101_32K · proprietary
86.1%
30qwen3.6-flash · proprietary
86.1%
31claude-sonnet-4-6_32K · proprietary
85.8%
32gpt-5.1-2025-11-13_medium · proprietary
85.6%
33qwen3.5-flash · proprietary
85.6%
34qwen3.5-plus · proprietary
85.0%
35gemini-2.5-pro · proprietary
84.2%
36grok-4-0709 · proprietary
84.0%
37o3-2025-04-16_high · proprietary
83.9%
38GLM 4.7 · 358.3B
83.3%
39kimi-k2-thinking-turbo · proprietary
83.1%
40claude-opus-4-5-20251101_16K · proprietary
81.7%
41o4-mini-2025-04-16_high · proprietary
81.7%
42gpt-5-nano-2025-08-07_high · proprietary
81.1%
43GLM 5 · 753.9B
80.0%
44gpt-5.2-2025-12-11_low · proprietary
78.9%
45gpt-5-mini-2025-08-07_medium · proprietary
78.3%
46claude-sonnet-4-5-20250929_32K · proprietary
77.8%
47claude-sonnet-4-5-20250929_59K · proprietary
77.8%
48grok-3-mini-beta_high · proprietary
77.8%
49o3-mini-2025-01-31_high · proprietary
76.9%
50gpt-5-nano-2025-08-07_medium · proprietary
74.2%
51o1-2024-12-17_medium · proprietary
73.3%
52qwen3-max-2025-09-23 · proprietary
73.3%
53gemini-2.5-flash-preview-04-17 · proprietary
73.1%
54claude-sonnet-4-20250514_32K · proprietary
71.1%
55claude-sonnet-4-5-20250929_16K · proprietary
71.1%
56gemini-2.5-flash-preview-05-20 · proprietary
70.8%
57claude-opus-4-1-20250805_27K · proprietary
68.9%
58claude-sonnet-4-20250514_59K · proprietary
68.9%
59claude-haiku-4-5-20251001_32K · proprietary
66.7%
60DeepSeek R1 0528 · 684.5B
66.4%
61claude-opus-4-1-20250805_16K · proprietary
64.4%
62claude-opus-4-20250514_27K · proprietary
64.4%
63gpt-5.1-2025-11-13_low · proprietary
63.9%
64o3-mini-2025-01-31_medium · proprietary
63.9%
65grok-3-mini-beta_low · proprietary
62.2%
66claude-opus-4-20250514_16K · proprietary
60.0%
67claude-3-7-sonnet-20250219_64K · proprietary
57.8%
68gemini-2.0-flash-thinking-exp-01-21 · proprietary
57.8%
69grok-3-beta · proprietary
55.6%
70claude-3-7-sonnet-20250219_32K · proprietary
53.3%
71claude-sonnet-4-20250514_16K · proprietary
53.3%
72DeepSeek R1 · 684.5B
53.3%
73DeepSeek R1 Distill Llama 70B · 70B
51.4%
74claude-opus-4-5-20251101 · proprietary
48.1%
75o1-mini-2024-09-12_high · proprietary
46.9%
76claude-3-7-sonnet-20250219_16K · proprietary
46.7%
77gpt-4.1-mini-2025-04-14 · proprietary
44.7%
78o1-mini-2024-09-12_medium · proprietary
44.7%
79claude-opus-4-20250514 · proprietary
42.2%
80claude-opus-4-1-20250805 · proprietary
40.0%
81gpt-4.1-2025-04-14 · proprietary
38.3%
82DeepSeek v3 0324 · 684.5B
37.8%
83gpt-4.5-preview-2025-02-27 · proprietary
37.8%
84claude-haiku-4-5-20251001 · proprietary
35.8%
85claude-sonnet-4-5-20250929 · proprietary
35.6%
86mistral-medium-2505 · proprietary
32.2%
87gemini-2.0-flash-001 · proprietary
31.1%
88o1-preview-2024-09-12 · proprietary
31.1%
89Magistral Small 2506 · 23.6B
30.0%
90claude-sonnet-4-20250514 · proprietary
28.9%
91gpt-4.1-nano-2025-04-14 · proprietary
28.9%
92gemini-1.5-pro-002 · proprietary
23.1%
93claude-3-7-sonnet-20250219 · proprietary
21.9%
94Llama-4-Maverick-17B-128E-Instruct-FP8 · proprietary
20.6%
95Gemma 3 27B IT · 27.4B
19.7%
96qwen-plus-2025-01-25 · proprietary
17.8%
97gemini-1.5-flash-002 · proprietary
16.3%
98qwen-max-2025-01-25 · proprietary
16.1%
99DeepSeek-V3 · proprietary
15.8%
100Phi 4 · 14.7B
13.8%
101grok-2-1212 · proprietary
11.5%
102Llama-3.1-405B-Instruct · proprietary
9.7%
103claude-3-5-sonnet-20241022 · proprietary
8.5%
104mistral-large-2407 · proprietary
8.5%
105Qwen2.5 72B Instruct · 72.7B
8.1%
106Llama 4 Scout 17B 16E Instruct · 108.6B
7.8%
107mistral-large-2411 · proprietary
7.8%
108Qwen2.5 32B Instruct · 32B
7.4%
109gpt-4o-mini-2024-07-18 · proprietary
6.9%
110gemini-1.5-pro-001 · proprietary
6.8%
111gpt-4-turbo-2024-04-09 · proprietary
6.7%
112claude-3-5-sonnet-20240620 · proprietary
6.5%
113gpt-4o-2024-08-06 · proprietary
6.4%
114gpt-4o-2024-05-13 · proprietary
6.3%
115gpt-4o-2024-11-20 · proprietary
6.3%
116qwen-turbo-2024-11-01 · proprietary
6.1%
117mistral-small-2503 · proprietary
5.8%
118mistral-small-2501 · proprietary
5.3%
119Llama 3.3 70B Instruct · 70.6B
5.1%
120claude-3-opus-20240229 · proprietary
4.7%
121gemini-1.5-flash-8b-001 · proprietary
4.6%
122Llama-3.1-Tulu-3-70B-DPO · proprietary
4.4%
123claude-3-5-haiku-20241022 · proprietary
4.3%
124Meta Llama 3 70B Instruct · 70.6B
4.3%
125gemini-1.5-flash-001 · proprietary
3.9%
126Llama 3.1 70B Instruct · 70.6B
3.6%
127Llama-3.2-90B-Vision-Instruct · proprietary
2.6%
128claude-2.0 · proprietary
2.5%
129claude-3-sonnet-20240229 · proprietary
2.5%
130Hermes-2-Theta-Llama-3-70B · proprietary
2.5%
131Llama 3.1 8B Instruct · 8.0B
2.5%
132claude-2.1 · proprietary
1.9%
133mistral-large-2402 · proprietary
1.9%
134claude-3-haiku-20240307 · proprietary
1.8%
135Gemma 2 27B IT · 27.2B
1.4%
136gemini-1.0-pro-001 · proprietary
1.1%
137gpt-4-0613 · proprietary
1.1%
138Meta Llama 3 8B Instruct · 8.0B
0.8%
139Gemma 2 9B IT · 9.2B
0.6%
140gpt-4-0314 · proprietary
0.6%
141Llama 2 70B Chat HF · 69.0B
0.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

10B100Bmodel size (log scale) →92.2%0.0%GLM 4.7 · 358B · 83.3%GLM 5 · 754B · 80.0%DeepSeek R1 0528 · 685B · 66.4%DeepSeek R1 · 685B · 53.3%DeepSeek v3 0324 · 685B · 37.8%Gemma 3 27B IT · 27B · 19.7%Qwen2.5 72B Instruct · 73B · 8.1%Llama 4 Scout 17B 16E Instruct · 109B · 7.8%Qwen2.5 32B Instruct · 32B · 7.4%Llama 3.3 70B Instruct · 71B · 5.1%Meta Llama 3 70B Instruct · 71B · 4.3%Llama 3.1 70B Instruct · 71B · 3.6%Gemma 2 27B IT · 27B · 1.4%Meta Llama 3 8B Instruct · 8B · 0.8%Gemma 2 9B IT · 9B · 0.6%Llama 2 70B Chat HF · 69B · 0.0%Llama 3.1 8B Instruct · 8B · 2.5%Llama 3.1 8B InstructPhi 4 · 15B · 13.8%Phi 4Magistral Small 2506 · 24B · 30.0%Magistral Small 2506DeepSeek R1 Distill Llama 70B · 70B · 51.4%DeepSeek R1 Distill L…Qwen3 235B A22B Thinking 2507 · 235B · 86.7%Qwen3 235B A22B Think…GLM 5.1 · 754B · 92.2%GLM 5.1
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Llama 3.1 8B Instruct, 8B, score 2.5% — on the efficiency frontier (best score at its size or smaller).
  • Phi 4, 15B, score 13.8% — on the efficiency frontier (best score at its size or smaller).
  • Magistral Small 2506, 24B, score 30.0% — on the efficiency frontier (best score at its size or smaller).
  • DeepSeek R1 Distill Llama 70B, 70B, score 51.4% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3 235B A22B Thinking 2507, 235B, score 86.7% — on the efficiency frontier (best score at its size or smaller).
  • GLM 5.1, 754B, score 92.2% — on the efficiency frontier (best score at its size or smaller).

AIME 2024/2025: frequently asked questions

What is the best open LLM on AIME 2024/2025?
GLM 5.1 is the top open model on AIME 2024/2025, scoring 92.2%. Among all models tested — including proprietary ones — it ranks #14.
What's the best AIME 2024/2025 model you can run on a 24 GB GPU?
Magistral Small 2506 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 13 GB), scoring 30.0% on AIME 2024/2025.
What's the best AIME 2024/2025 model you can run on a 12 GB GPU?
Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 13.8% on AIME 2024/2025.
Can open models match proprietary models on AIME 2024/2025?
Not quite on AIME 2024/2025: the strongest proprietary model (gpt-5.5-pro-pre-release_xhigh) scores 100.0%, ahead of the best open model (GLM 5.1) at 92.2% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.