What is the best open LLM on ARC-AGI?

Kimi K2.5 is the top open model on ARC-AGI, scoring 65.3%. Among all models tested — including proprietary ones — it ranks #55. The top model overall is Gemini 3.1 Pro Preview (Google DeepMind) at 98.0%.

What's the best ARC-AGI model you can run on a 24 GB GPU?

Magistral Small 2506 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 13 GB), scoring 5.0% on ARC-AGI.

Can open models match proprietary models on ARC-AGI?

Not quite on ARC-AGI: the strongest proprietary model (Gemini 3.1 Pro Preview) scores 98.0%, ahead of the best open model (Kimi K2.5) at 65.3% — but you can run the open one yourself.

Reasoning

ARC-AGI Leaderboard

Name: ARC-AGI — open LLM scores
Creator: epoch

ARC-AGI tests fluid, abstract reasoning on small visual grid puzzles where each task follows a novel rule the model must infer from a few examples. It deliberately resists memorization and is one of the most-watched measures of general reasoning progress.

Source: epoch10 open models ranked+148 proprietaryData through Jul 2026

Open models All models

Open models ranked on ARC-AGI

# shows rank among open models / rank overall (including proprietary).

#	Model	Score
1 / 55	Kimi K2.5 · 1058.6B	65.3%
2 / 58	MiniMax M2.5 · 228.7B	63.7%
3 / 66	DeepSeek V3.2 · 685.4B	57.0%
4 / 79	GLM 5 · 753.9B	44.7%
5 / 125	DeepSeek R1 0528 · 684.5B	21.2%
6 / 134	DeepSeek R1 · 684.5B	15.8%
7 / 143	Qwen3 235B A22B Instruct 2507 · 235.1B	11.0%
8 / 151	Magistral Small 2506 · 23.6B	5.0%
9 / 153	Llama 4 Maverick 17B 128E Instruct · 401.6B	4.4%
10 / 157	Llama 4 Scout 17B 16E Instruct · 108.6B	0.5%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

ARC-AGI: frequently asked questions

What is the best open LLM on ARC-AGI?: Kimi K2.5 is the top open model on ARC-AGI, scoring 65.3%. Among all models tested — including proprietary ones — it ranks #55. The top model overall is Gemini 3.1 Pro Preview (Google DeepMind) at 98.0%.
What's the best ARC-AGI model you can run on a 24 GB GPU?: Magistral Small 2506 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 13 GB), scoring 5.0% on ARC-AGI.
Can open models match proprietary models on ARC-AGI?: Not quite on ARC-AGI: the strongest proprietary model (Gemini 3.1 Pro Preview) scores 98.0%, ahead of the best open model (Kimi K2.5) at 65.3% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.