Reasoning

ARC-AGI Leaderboard

ARC-AGI tests fluid, abstract reasoning on small visual grid puzzles where each task follows a novel rule the model must infer from a few examples. It deliberately resists memorization and is one of the most-watched measures of general reasoning progress.

Source: epoch7 open models ranked+130 proprietaryData through May 2026

All models ranked on ARC-AGI

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gemini-3.1-pro-preview · proprietary
98.0%
2gpt-5.5-pro_high · proprietary
96.5%
3gpt-5.5_xhigh · proprietary
95.0%
4gpt-5.5-pro_xhigh · proprietary
95.0%
5gpt-5.4-pro-2026-03-05_xhigh · proprietary
94.5%
6gpt-5.5_high · proprietary
94.5%
7claude-opus-4-6_120K · proprietary
94.0%
8gpt-5.4-2026-03-05_xhigh · proprietary
93.7%
9claude-opus-4-7_high · proprietary
93.5%
10gpt-5.4-2026-03-05_high · proprietary
92.7%
11gemini-3.5-flash_high · proprietary
92.5%
12gpt-5.5_medium · proprietary
92.2%
13claude-opus-4-7_max · proprietary
92.0%
14claude-opus-4-7_low · proprietary
91.0%
15gpt-5.2-pro-2025-12-11_xhigh · proprietary
90.5%
16grok-4-20 · proprietary
89.5%
17gemini-3-deep-think-preview · proprietary
87.5%
18claude-sonnet-4-6_high · proprietary
86.5%
19gpt-5.2-2025-12-11_xhigh · proprietary
86.2%
20gpt-5.4-2026-03-05_medium · proprietary
86.2%
21claude-sonnet-4-6_max · proprietary
86.0%
22gpt-5.2-pro-2025-12-11_high · proprietary
85.7%
23gpt-5.2-pro-2025-12-11_medium · proprietary
81.2%
24claude-opus-4-5-20251101_64K · proprietary
80.0%
25gpt-5.2-2025-12-11_high · proprietary
78.7%
26gpt-5.5_low · proprietary
76.2%
27claude-opus-4-5-20251101_32K · proprietary
75.8%
28gemini-3-pro-preview · proprietary
75.0%
29gpt-5.1-2025-11-13_high · proprietary
72.8%
30gpt-5.2-2025-12-11_medium · proprietary
72.7%
31claude-opus-4-5-20251101_16K · proprietary
72.0%
32gpt-5-pro-2025-10-06_high · proprietary
70.2%
33gpt-5-pro-2025-10-06_unknown · proprietary
70.2%
34gpt-5.4-2026-03-05_low · proprietary
68.2%
35grok-4-0709 · proprietary
66.7%
36gpt-5-2025-08-07_high · proprietary
65.7%
37kimi-k2.5 · proprietary
65.3%
38claude-sonnet-4-5-20250929_32K · proprietary
63.7%
39gpt-5.4-mini-2026-03-17_xhigh · proprietary
63.7%
40MiniMax M2.5 · 228.7B
63.7%
41o3-2025-04-16_high · proprietary
60.8%
42o3-pro-2025-06-10_high · proprietary
59.3%
43o4-mini-2025-04-16_high · proprietary
58.7%
44claude-opus-4-5-20251101_8K · proprietary
58.7%
45gpt-5.4-mini-2026-03-17_high · proprietary
58.0%
46gpt-5.1-2025-11-13_medium · proprietary
57.7%
47deepseek/deepseek-v3.2 · proprietary
57.0%
48o3-pro-2025-06-10_medium · proprietary
57.0%
49gpt-5-2025-08-07_medium · proprietary
56.2%
50gpt-5.2-2025-12-11_low · proprietary
55.7%
51gpt-5-mini-2025-08-07_high · proprietary
54.3%
52o3-2025-04-16_medium · proprietary
53.8%
53gpt-5.4-nano-2026-03-17_xhigh · proprietary
51.5%
54gemini-3.5-flash_minimal · proprietary
48.8%
55grok-4-fast · proprietary
48.5%
56claude-sonnet-4-5-20250929_16K · proprietary
48.3%
57claude-haiku-4-5-20251001_32K · proprietary
47.7%
58claude-sonnet-4-5-20250929_8K · proprietary
46.5%
59GLM 5 · 753.9B
44.7%
60o3-pro-2025-06-10_low · proprietary
44.3%
61gpt-5-2025-08-07_low · proprietary
44.0%
62o4-mini-2025-04-16_medium · proprietary
41.8%
63o3-2025-04-16_low · proprietary
41.5%
64gemini-2.5-pro_16K · proprietary
41.0%
65gpt-5.4-mini-2026-03-17_medium · proprietary
40.8%
66claude-opus-4-5-20251101 · proprietary
40.0%
67claude-sonnet-4-20250514_16K · proprietary
40.0%
68tiny-recursion-model · proprietary
40.0%
69gpt-5.4-nano-2026-03-17_high · proprietary
38.2%
70claude-haiku-4-5-20251001_16K · proprietary
37.3%
71gpt-5-mini-2025-08-07_medium · proprietary
37.3%
72gemini-2.5-pro_32K · proprietary
37.0%
73claude-opus-4-20250514_16K · proprietary
35.7%
74o3-mini-2025-01-31_high · proprietary
34.5%
75gemini-2.5-flash-preview-05-20 · proprietary
33.3%
76gemini-2.5-flash-preview-05-20_16K · proprietary
33.3%
77gpt-5.1-2025-11-13_low · proprietary
33.2%
78gemini-2.5-pro-preview-03-25 · proprietary
33.0%
79gpt-5.4-nano-2026-03-17_medium · proprietary
33.0%
80gemini-2.5-flash-preview-05-20_23K · proprietary
32.3%
81gemini-2.5-flash-preview-04-17 (24K thinking) · proprietary
32.3%
82gemini-2.5-pro-preview-06-05_1K · proprietary
31.3%
83claude-sonnet-4-5-20250929_1K · proprietary
31.0%
84claude-opus-4-20250514_8K · proprietary
30.7%
85o1-2024-12-17_medium · proprietary
30.7%
86gemini-2.5-pro_8K · proprietary
29.5%
87claude-sonnet-4-20250514_8K · proprietary
29.0%
88claude-3-7-sonnet-20250219_16K · proprietary
28.6%
89claude-sonnet-4-20250514_1K · proprietary
28.0%
90codex-mini-2025-05-16 · proprietary
27.3%
91o1-2024-12-17_low · proprietary
27.2%
92claude-opus-4-20250514_1K · proprietary
27.0%
93gpt-5-mini-2025-08-07_low · proprietary
26.3%
94gemini-2.5-flash-preview-05-20_8K · proprietary
25.8%
95claude-haiku-4-5-20251001_8K · proprietary
25.5%
96claude-sonnet-4-5-20250929 · proprietary
25.5%
97claude-sonnet-4-20250514 · proprietary
23.8%
98o1-pro-2025-03-19_low · proprietary
23.3%
99claude-opus-4-20250514 · proprietary
22.5%
100o3-mini-2025-01-31_medium · proprietary
22.3%
101gemini-3-flash-preview · proprietary
21.5%
102o4-mini-2025-04-16_low · proprietary
21.3%
103claude-3-7-sonnet-20250219_8K · proprietary
21.2%
104DeepSeek R1 0528 · 684.5B
21.2%
105gpt-5-nano-2025-08-07_medium · proprietary
20.7%
106gpt-5.4-nano-2026-03-17_low · proprietary
18.3%
107o1-preview-2024-09-12 · proprietary
18.0%
108claude-haiku-4-5-20251001_1K · proprietary
16.8%
109gpt-5-nano-2025-08-07_high · proprietary
16.7%
110grok-3-mini_low · proprietary
16.5%
111grok-3-mini-beta_low · proprietary
16.5%
112gemini-2.5-flash-preview-05-20_1K · proprietary
16.0%
113DeepSeek R1 · 684.5B
15.8%
114o3-mini-2025-01-31_low · proprietary
14.5%
115claude-haiku-4-5-20251001 · proprietary
14.3%
116o1-mini-2024-09-12_medium · proprietary
14.0%
117o1-mini-2024-09-12_unknown · proprietary
14.0%
118claude-3-7-sonnet-20250219 · proprietary
13.6%
119gpt-5.4-mini-2026-03-17_low · proprietary
13.0%
120gpt-5.2-2025-12-11_unknown · proprietary
12.3%
121claude-3-7-sonnet-20250219_1K · proprietary
11.6%
122Qwen3 235B A22B Instruct 2507 · 235.1B
11.0%
123gpt-4.5-preview-2025-02-27 · proprietary
10.3%
124gpt-5-2025-08-07_minimal · proprietary
6.0%
125magistral-medium-2506 · proprietary
5.9%
126gpt-5.1-2025-11-13_none · proprietary
5.8%
127gpt-4.1-2025-04-14 · proprietary
5.5%
128grok-3 · proprietary
5.5%
129gpt-5-mini-2025-08-07_minimal · proprietary
5.3%
130Magistral Small 2506 · 23.6B
5.0%
131gpt-4o-2024-11-20 · proprietary
4.5%
132Llama-4-Maverick-17B-128E-Instruct · proprietary
4.4%
133gpt-5-nano-2025-08-07_low · proprietary
4.0%
134gpt-4.1-mini-2025-04-14 · proprietary
3.5%
135gpt-5-nano-2025-08-07_minimal · proprietary
1.5%
136Llama 4 Scout 17B 16E Instruct · 108.6B
0.5%
137gpt-4.1-nano-2025-04-14 · proprietary
0.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

24B754Bmodel size (log scale) →63.7%0.5%GLM 5 · 754B · 44.7%DeepSeek R1 0528 · 685B · 21.2%DeepSeek R1 · 685B · 15.8%Qwen3 235B A22B Instruct 2507 · 235B · 11.0%Llama 4 Scout 17B 16E Instruct · 109B · 0.5%Magistral Small 2506 · 24B · 5.0%Magistral Small 2506MiniMax M2.5 · 229B · 63.7%MiniMax M2.5
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Magistral Small 2506, 24B, score 5.0% — on the efficiency frontier (best score at its size or smaller).
  • MiniMax M2.5, 229B, score 63.7% — on the efficiency frontier (best score at its size or smaller).

ARC-AGI: frequently asked questions

What is the best open LLM on ARC-AGI?
MiniMax M2.5 is the top open model on ARC-AGI, scoring 63.7%. Among all models tested — including proprietary ones — it ranks #38.
What's the best ARC-AGI model you can run on a 24 GB GPU?
Magistral Small 2506 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 13 GB), scoring 5.0% on ARC-AGI.
Can open models match proprietary models on ARC-AGI?
Not quite on ARC-AGI: the strongest proprietary model (gemini-3.1-pro-preview) scores 98.0%, ahead of the best open model (MiniMax M2.5) at 63.7% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.