Reasoning

ARC-AGI Leaderboard

ARC-AGI tests fluid, abstract reasoning on small visual grid puzzles where each task follows a novel rule the model must infer from a few examples. It deliberately resists memorization and is one of the most-watched measures of general reasoning progress.

Source: epoch7 open models ranked+130 proprietaryData through May 2026

Open models ranked on ARC-AGI

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 40MiniMax M2.5 · 228.7B
63.7%
2 / 59GLM 5 · 753.9B
44.7%
3 / 104DeepSeek R1 0528 · 684.5B
21.2%
4 / 113DeepSeek R1 · 684.5B
15.8%
5 / 122Qwen3 235B A22B Instruct 2507 · 235.1B
11.0%
6 / 130Magistral Small 2506 · 23.6B
5.0%
7 / 136Llama 4 Scout 17B 16E Instruct · 108.6B
0.5%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

24B754Bmodel size (log scale) →63.7%0.5%GLM 5 · 754B · 44.7%DeepSeek R1 0528 · 685B · 21.2%DeepSeek R1 · 685B · 15.8%Qwen3 235B A22B Instruct 2507 · 235B · 11.0%Llama 4 Scout 17B 16E Instruct · 109B · 0.5%Magistral Small 2506 · 24B · 5.0%Magistral Small 2506MiniMax M2.5 · 229B · 63.7%MiniMax M2.5
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Magistral Small 2506, 24B, score 5.0% — on the efficiency frontier (best score at its size or smaller).
  • MiniMax M2.5, 229B, score 63.7% — on the efficiency frontier (best score at its size or smaller).

ARC-AGI: frequently asked questions

What is the best open LLM on ARC-AGI?
MiniMax M2.5 is the top open model on ARC-AGI, scoring 63.7%. Among all models tested — including proprietary ones — it ranks #38.
What's the best ARC-AGI model you can run on a 24 GB GPU?
Magistral Small 2506 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 13 GB), scoring 5.0% on ARC-AGI.
Can open models match proprietary models on ARC-AGI?
Not quite on ARC-AGI: the strongest proprietary model (gemini-3.1-pro-preview) scores 98.0%, ahead of the best open model (MiniMax M2.5) at 63.7% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.