Reasoning
ARC-AGI Leaderboard
ARC-AGI tests fluid, abstract reasoning on small visual grid puzzles where each task follows a novel rule the model must infer from a few examples. It deliberately resists memorization and is one of the most-watched measures of general reasoning progress.
Source: epoch7 open models ranked+130 proprietaryData through May 2026
Open models ranked on ARC-AGI
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 40 | MiniMax M2.5 · 228.7B | 63.7% |
| 2 / 59 | GLM 5 · 753.9B | 44.7% |
| 3 / 104 | DeepSeek R1 0528 · 684.5B | 21.2% |
| 4 / 113 | DeepSeek R1 · 684.5B | 15.8% |
| 5 / 122 | Qwen3 235B A22B Instruct 2507 · 235.1B | 11.0% |
| 6 / 130 | Magistral Small 2506 · 23.6B | 5.0% |
| 7 / 136 | Llama 4 Scout 17B 16E Instruct · 108.6B | 0.5% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Magistral Small 2506, 24B, score 5.0% — on the efficiency frontier (best score at its size or smaller).
- MiniMax M2.5, 229B, score 63.7% — on the efficiency frontier (best score at its size or smaller).
ARC-AGI: frequently asked questions
- What is the best open LLM on ARC-AGI?
- MiniMax M2.5 is the top open model on ARC-AGI, scoring 63.7%. Among all models tested — including proprietary ones — it ranks #38.
- What's the best ARC-AGI model you can run on a 24 GB GPU?
- Magistral Small 2506 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 13 GB), scoring 5.0% on ARC-AGI.
- Can open models match proprietary models on ARC-AGI?
- Not quite on ARC-AGI: the strongest proprietary model (gemini-3.1-pro-preview) scores 98.0%, ahead of the best open model (MiniMax M2.5) at 63.7% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.