What is the best open LLM on AIME 2024/2025?

DeepSeek V4 Pro is the top open model on AIME 2024/2025, scoring 96.7%. Among all models tested — including proprietary ones — it ranks #11. The top model overall is GPT 5.5 Pro Pre Release (xhigh) (OpenAI) at 100.0%.

What's the best AIME 2024/2025 model you can run on a 24 GB GPU?

Magistral Small 2506 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 13 GB), scoring 30.0% on AIME 2024/2025.

What's the best AIME 2024/2025 model you can run on a 12 GB GPU?

Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 13.8% on AIME 2024/2025.

Can open models match proprietary models on AIME 2024/2025?

Not quite on AIME 2024/2025: the strongest proprietary model (GPT 5.5 Pro Pre Release (xhigh)) scores 100.0%, ahead of the best open model (DeepSeek V4 Pro) at 96.7% — but you can run the open one yourself.

Math

AIME 2024/2025 Leaderboard

Name: AIME 2024/2025 — open LLM scores
Creator: epoch

AIME (American Invitational Mathematics Examination) is a prestigious high-school competition of hard, integer-answer problems. It's a widely-cited yardstick for multi-step mathematical reasoning, here on the 2024–2025 papers.

Source: epoch34 open models ranked+121 proprietaryData through Jul 2026

Open models All models

Open models ranked on AIME 2024/2025

# shows rank among open models / rank overall (including proprietary).

#	Model	Score
1 / 11	DeepSeek V4 Pro · 861.6B	96.7%
2 / 12	Kimi K2.7 Code · 1058.6B	96.4%
3 / 14	Kimi K2.6 · 1058.6B	96.1%
4 / 26	GLM 5.1 · 753.9B	92.2%
5 / 28	Kimi K2.5 · 1058.6B	92.2%
6 / 34	GPT OSS 120B · 120.4B	88.9%
7 / 41	Qwen3 235B A22B Thinking 2507 · 235.1B	86.7%
8 / 52	GLM 4.7 · 358.3B	83.3%
9 / 53	Kimi K2 Thinking · 1058.1B	83.1%
10 / 57	GLM 5 · 753.9B	80.0%
11 / 74	DeepSeek R1 0528 · 684.5B	66.4%
12 / 86	DeepSeek R1 · 684.5B	53.3%
13 / 87	DeepSeek R1 Distill Llama 70B · 70.6B	51.4%
14 / 96	DeepSeek v3 0324 · 684.5B	37.8%
15 / 103	Magistral Small 2506 · 23.6B	30.0%
16 / 108	Llama 4 Maverick 17B 128E Instruct · 401.6B	20.6%
17 / 109	Gemma 3 27B IT · 27.4B	19.7%
18 / 113	DeepSeek v3 · 684.5B	15.8%
19 / 114	Phi 4 · 14.7B	13.8%
20 / 116	Llama 3.1 405B Instruct · 405.9B	9.7%
21 / 119	Qwen2.5 72B Instruct · 72.7B	8.1%
22 / 120	Llama 4 Scout 17B 16E Instruct · 108.6B	7.8%
23 / 122	Qwen2.5 32B Instruct · 32.8B	7.4%
24 / 133	Llama 3.3 70B Instruct · 70.6B	5.1%
25 / 136	Llama 3.1 Tulu 3 70B DPO · 70.6B	4.4%
26 / 138	Meta Llama 3 70B Instruct · 70.6B	4.3%
27 / 140	Llama 3.1 70B Instruct · 70.6B	3.6%
28 / 141	Llama 3.2 90B Vision Instruct · 88.6B	2.6%
29 / 144	Hermes 2 Theta Llama 3 70B · 70.6B	2.5%
30 / 145	Llama 3.1 8B Instruct · 8.0B	2.5%
31 / 149	Gemma 2 27B IT · 27.2B	1.4%
32 / 152	Meta Llama 3 8B Instruct · 8.0B	0.8%
33 / 153	Gemma 2 9B IT · 9.2B	0.6%
34 / 155	Llama 2 70B Chat HF · 69.0B	0.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

AIME 2024/2025: frequently asked questions

What is the best open LLM on AIME 2024/2025?: DeepSeek V4 Pro is the top open model on AIME 2024/2025, scoring 96.7%. Among all models tested — including proprietary ones — it ranks #11. The top model overall is GPT 5.5 Pro Pre Release (xhigh) (OpenAI) at 100.0%.
What's the best AIME 2024/2025 model you can run on a 24 GB GPU?: Magistral Small 2506 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 13 GB), scoring 30.0% on AIME 2024/2025.
What's the best AIME 2024/2025 model you can run on a 12 GB GPU?: Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 13.8% on AIME 2024/2025.
Can open models match proprietary models on AIME 2024/2025?: Not quite on AIME 2024/2025: the strongest proprietary model (GPT 5.5 Pro Pre Release (xhigh)) scores 100.0%, ahead of the best open model (DeepSeek V4 Pro) at 96.7% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.