Reasoning
ARC-AGI Leaderboard
ARC-AGI tests fluid, abstract reasoning on small visual grid puzzles where each task follows a novel rule the model must infer from a few examples. It deliberately resists memorization and is one of the most-watched measures of general reasoning progress.
Source: epoch7 open models ranked+130 proprietaryData through May 2026
All models ranked on ARC-AGI
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | gemini-3.1-pro-preview · proprietary | 98.0% |
| 2 | gpt-5.5-pro_high · proprietary | 96.5% |
| 3 | gpt-5.5_xhigh · proprietary | 95.0% |
| 4 | gpt-5.5-pro_xhigh · proprietary | 95.0% |
| 5 | gpt-5.4-pro-2026-03-05_xhigh · proprietary | 94.5% |
| 6 | gpt-5.5_high · proprietary | 94.5% |
| 7 | claude-opus-4-6_120K · proprietary | 94.0% |
| 8 | gpt-5.4-2026-03-05_xhigh · proprietary | 93.7% |
| 9 | claude-opus-4-7_high · proprietary | 93.5% |
| 10 | gpt-5.4-2026-03-05_high · proprietary | 92.7% |
| 11 | gemini-3.5-flash_high · proprietary | 92.5% |
| 12 | gpt-5.5_medium · proprietary | 92.2% |
| 13 | claude-opus-4-7_max · proprietary | 92.0% |
| 14 | claude-opus-4-7_low · proprietary | 91.0% |
| 15 | gpt-5.2-pro-2025-12-11_xhigh · proprietary | 90.5% |
| 16 | grok-4-20 · proprietary | 89.5% |
| 17 | gemini-3-deep-think-preview · proprietary | 87.5% |
| 18 | claude-sonnet-4-6_high · proprietary | 86.5% |
| 19 | gpt-5.2-2025-12-11_xhigh · proprietary | 86.2% |
| 20 | gpt-5.4-2026-03-05_medium · proprietary | 86.2% |
| 21 | claude-sonnet-4-6_max · proprietary | 86.0% |
| 22 | gpt-5.2-pro-2025-12-11_high · proprietary | 85.7% |
| 23 | gpt-5.2-pro-2025-12-11_medium · proprietary | 81.2% |
| 24 | claude-opus-4-5-20251101_64K · proprietary | 80.0% |
| 25 | gpt-5.2-2025-12-11_high · proprietary | 78.7% |
| 26 | gpt-5.5_low · proprietary | 76.2% |
| 27 | claude-opus-4-5-20251101_32K · proprietary | 75.8% |
| 28 | gemini-3-pro-preview · proprietary | 75.0% |
| 29 | gpt-5.1-2025-11-13_high · proprietary | 72.8% |
| 30 | gpt-5.2-2025-12-11_medium · proprietary | 72.7% |
| 31 | claude-opus-4-5-20251101_16K · proprietary | 72.0% |
| 32 | gpt-5-pro-2025-10-06_high · proprietary | 70.2% |
| 33 | gpt-5-pro-2025-10-06_unknown · proprietary | 70.2% |
| 34 | gpt-5.4-2026-03-05_low · proprietary | 68.2% |
| 35 | grok-4-0709 · proprietary | 66.7% |
| 36 | gpt-5-2025-08-07_high · proprietary | 65.7% |
| 37 | kimi-k2.5 · proprietary | 65.3% |
| 38 | claude-sonnet-4-5-20250929_32K · proprietary | 63.7% |
| 39 | gpt-5.4-mini-2026-03-17_xhigh · proprietary | 63.7% |
| 40 | MiniMax M2.5 · 228.7B | 63.7% |
| 41 | o3-2025-04-16_high · proprietary | 60.8% |
| 42 | o3-pro-2025-06-10_high · proprietary | 59.3% |
| 43 | o4-mini-2025-04-16_high · proprietary | 58.7% |
| 44 | claude-opus-4-5-20251101_8K · proprietary | 58.7% |
| 45 | gpt-5.4-mini-2026-03-17_high · proprietary | 58.0% |
| 46 | gpt-5.1-2025-11-13_medium · proprietary | 57.7% |
| 47 | deepseek/deepseek-v3.2 · proprietary | 57.0% |
| 48 | o3-pro-2025-06-10_medium · proprietary | 57.0% |
| 49 | gpt-5-2025-08-07_medium · proprietary | 56.2% |
| 50 | gpt-5.2-2025-12-11_low · proprietary | 55.7% |
| 51 | gpt-5-mini-2025-08-07_high · proprietary | 54.3% |
| 52 | o3-2025-04-16_medium · proprietary | 53.8% |
| 53 | gpt-5.4-nano-2026-03-17_xhigh · proprietary | 51.5% |
| 54 | gemini-3.5-flash_minimal · proprietary | 48.8% |
| 55 | grok-4-fast · proprietary | 48.5% |
| 56 | claude-sonnet-4-5-20250929_16K · proprietary | 48.3% |
| 57 | claude-haiku-4-5-20251001_32K · proprietary | 47.7% |
| 58 | claude-sonnet-4-5-20250929_8K · proprietary | 46.5% |
| 59 | GLM 5 · 753.9B | 44.7% |
| 60 | o3-pro-2025-06-10_low · proprietary | 44.3% |
| 61 | gpt-5-2025-08-07_low · proprietary | 44.0% |
| 62 | o4-mini-2025-04-16_medium · proprietary | 41.8% |
| 63 | o3-2025-04-16_low · proprietary | 41.5% |
| 64 | gemini-2.5-pro_16K · proprietary | 41.0% |
| 65 | gpt-5.4-mini-2026-03-17_medium · proprietary | 40.8% |
| 66 | claude-opus-4-5-20251101 · proprietary | 40.0% |
| 67 | claude-sonnet-4-20250514_16K · proprietary | 40.0% |
| 68 | tiny-recursion-model · proprietary | 40.0% |
| 69 | gpt-5.4-nano-2026-03-17_high · proprietary | 38.2% |
| 70 | claude-haiku-4-5-20251001_16K · proprietary | 37.3% |
| 71 | gpt-5-mini-2025-08-07_medium · proprietary | 37.3% |
| 72 | gemini-2.5-pro_32K · proprietary | 37.0% |
| 73 | claude-opus-4-20250514_16K · proprietary | 35.7% |
| 74 | o3-mini-2025-01-31_high · proprietary | 34.5% |
| 75 | gemini-2.5-flash-preview-05-20 · proprietary | 33.3% |
| 76 | gemini-2.5-flash-preview-05-20_16K · proprietary | 33.3% |
| 77 | gpt-5.1-2025-11-13_low · proprietary | 33.2% |
| 78 | gemini-2.5-pro-preview-03-25 · proprietary | 33.0% |
| 79 | gpt-5.4-nano-2026-03-17_medium · proprietary | 33.0% |
| 80 | gemini-2.5-flash-preview-05-20_23K · proprietary | 32.3% |
| 81 | gemini-2.5-flash-preview-04-17 (24K thinking) · proprietary | 32.3% |
| 82 | gemini-2.5-pro-preview-06-05_1K · proprietary | 31.3% |
| 83 | claude-sonnet-4-5-20250929_1K · proprietary | 31.0% |
| 84 | claude-opus-4-20250514_8K · proprietary | 30.7% |
| 85 | o1-2024-12-17_medium · proprietary | 30.7% |
| 86 | gemini-2.5-pro_8K · proprietary | 29.5% |
| 87 | claude-sonnet-4-20250514_8K · proprietary | 29.0% |
| 88 | claude-3-7-sonnet-20250219_16K · proprietary | 28.6% |
| 89 | claude-sonnet-4-20250514_1K · proprietary | 28.0% |
| 90 | codex-mini-2025-05-16 · proprietary | 27.3% |
| 91 | o1-2024-12-17_low · proprietary | 27.2% |
| 92 | claude-opus-4-20250514_1K · proprietary | 27.0% |
| 93 | gpt-5-mini-2025-08-07_low · proprietary | 26.3% |
| 94 | gemini-2.5-flash-preview-05-20_8K · proprietary | 25.8% |
| 95 | claude-haiku-4-5-20251001_8K · proprietary | 25.5% |
| 96 | claude-sonnet-4-5-20250929 · proprietary | 25.5% |
| 97 | claude-sonnet-4-20250514 · proprietary | 23.8% |
| 98 | o1-pro-2025-03-19_low · proprietary | 23.3% |
| 99 | claude-opus-4-20250514 · proprietary | 22.5% |
| 100 | o3-mini-2025-01-31_medium · proprietary | 22.3% |
| 101 | gemini-3-flash-preview · proprietary | 21.5% |
| 102 | o4-mini-2025-04-16_low · proprietary | 21.3% |
| 103 | claude-3-7-sonnet-20250219_8K · proprietary | 21.2% |
| 104 | DeepSeek R1 0528 · 684.5B | 21.2% |
| 105 | gpt-5-nano-2025-08-07_medium · proprietary | 20.7% |
| 106 | gpt-5.4-nano-2026-03-17_low · proprietary | 18.3% |
| 107 | o1-preview-2024-09-12 · proprietary | 18.0% |
| 108 | claude-haiku-4-5-20251001_1K · proprietary | 16.8% |
| 109 | gpt-5-nano-2025-08-07_high · proprietary | 16.7% |
| 110 | grok-3-mini_low · proprietary | 16.5% |
| 111 | grok-3-mini-beta_low · proprietary | 16.5% |
| 112 | gemini-2.5-flash-preview-05-20_1K · proprietary | 16.0% |
| 113 | DeepSeek R1 · 684.5B | 15.8% |
| 114 | o3-mini-2025-01-31_low · proprietary | 14.5% |
| 115 | claude-haiku-4-5-20251001 · proprietary | 14.3% |
| 116 | o1-mini-2024-09-12_medium · proprietary | 14.0% |
| 117 | o1-mini-2024-09-12_unknown · proprietary | 14.0% |
| 118 | claude-3-7-sonnet-20250219 · proprietary | 13.6% |
| 119 | gpt-5.4-mini-2026-03-17_low · proprietary | 13.0% |
| 120 | gpt-5.2-2025-12-11_unknown · proprietary | 12.3% |
| 121 | claude-3-7-sonnet-20250219_1K · proprietary | 11.6% |
| 122 | Qwen3 235B A22B Instruct 2507 · 235.1B | 11.0% |
| 123 | gpt-4.5-preview-2025-02-27 · proprietary | 10.3% |
| 124 | gpt-5-2025-08-07_minimal · proprietary | 6.0% |
| 125 | magistral-medium-2506 · proprietary | 5.9% |
| 126 | gpt-5.1-2025-11-13_none · proprietary | 5.8% |
| 127 | gpt-4.1-2025-04-14 · proprietary | 5.5% |
| 128 | grok-3 · proprietary | 5.5% |
| 129 | gpt-5-mini-2025-08-07_minimal · proprietary | 5.3% |
| 130 | Magistral Small 2506 · 23.6B | 5.0% |
| 131 | gpt-4o-2024-11-20 · proprietary | 4.5% |
| 132 | Llama-4-Maverick-17B-128E-Instruct · proprietary | 4.4% |
| 133 | gpt-5-nano-2025-08-07_low · proprietary | 4.0% |
| 134 | gpt-4.1-mini-2025-04-14 · proprietary | 3.5% |
| 135 | gpt-5-nano-2025-08-07_minimal · proprietary | 1.5% |
| 136 | Llama 4 Scout 17B 16E Instruct · 108.6B | 0.5% |
| 137 | gpt-4.1-nano-2025-04-14 · proprietary | 0.0% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Magistral Small 2506, 24B, score 5.0% — on the efficiency frontier (best score at its size or smaller).
- MiniMax M2.5, 229B, score 63.7% — on the efficiency frontier (best score at its size or smaller).
ARC-AGI: frequently asked questions
- What is the best open LLM on ARC-AGI?
- MiniMax M2.5 is the top open model on ARC-AGI, scoring 63.7%. Among all models tested — including proprietary ones — it ranks #38.
- What's the best ARC-AGI model you can run on a 24 GB GPU?
- Magistral Small 2506 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 13 GB), scoring 5.0% on ARC-AGI.
- Can open models match proprietary models on ARC-AGI?
- Not quite on ARC-AGI: the strongest proprietary model (gemini-3.1-pro-preview) scores 98.0%, ahead of the best open model (MiniMax M2.5) at 63.7% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.