What is the best open LLM on SimpleBench?

DeepSeek V4 Pro is the top open model on SimpleBench, scoring 61.2%. Among all models tested — including proprietary ones — it ranks #22. The top model overall is Claude Fable 5 Max (Anthropic) at 81.9%.

Can open models match proprietary models on SimpleBench?

Not quite on SimpleBench: the strongest proprietary model (Claude Fable 5 Max) scores 81.9%, ahead of the best open model (DeepSeek V4 Pro) at 61.2% — but you can run the open one yourself.

Reasoning

SimpleBench Leaderboard

Name: SimpleBench — open LLM scores
Creator: epoch

SimpleBench is a set of everyday, common-sense and trick questions that humans answer easily but language models often get wrong. It probes basic reasoning and robustness rather than specialist knowledge.

Source: epoch19 open models ranked+71 proprietaryData through Jul 2026

Open models All models

All models ranked on SimpleBench

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#	Model	Score
1	Claude Fable 5 Max · proprietary	81.9%
2	Gemini 3.1 Pro Preview · proprietary	79.6%
3	GPT 5.5 Pro (high) · proprietary	76.9%
4	GPT 5.5 Pro (unspecified) · proprietary	76.9%
5	Gemini 3.5 Flash (unspecified) · proprietary	76.7%
6	Gemini 3 Pro Preview · proprietary	76.4%
7	GPT 5.4 Pro (Mar 05, 2026, unspecified) · proprietary	74.1%
8	GPT 5.6 Sol Prounknown · proprietary	71.7%
9	Qwen3.7 Max · proprietary	70.4%
10	Grok 4.5 (unspecified) · proprietary	70.0%
11	GPT 5.5 (unspecified) · proprietary	69.0%
12	Claude Opus 4.6 · proprietary	67.6%
13	Claude Opus 4.8 (unspecified) · proprietary	64.8%
14	GPT 5.6 Sol (unspecified) · proprietary	64.8%
15	Qwen3.6 Max Preview · proprietary	63.0%
16	Claude Opus 4.7 (unspecified) · proprietary	62.9%
17	Gemini 2.5 Pro Preview (Jun 05) · proprietary	62.4%
18	Claude Opus 4.5 (Nov 01, 2025, unspecified) · proprietary	62.0%
19	Claude Opus 4.5 (Nov 01, 2025) · proprietary	62.0%
20	GPT 5 Pro (Oct 06, 2025, high) · proprietary	61.6%
21	GPT 5 Pro (Oct 06, 2025, unspecified) · proprietary	61.6%
22	DeepSeek V4 Pro · 861.6B	61.2%
23	Gemini 3 Flash Preview · proprietary	61.1%
24	Claude Sonnet 5 (unspecified) · proprietary	60.6%
25	Grok 4 (Jul 09) · proprietary	60.5%
26	Claude Opus 4.1 (Aug 05, 2025, unspecified) · proprietary	60.0%
27	Claude Opus 4.1 (Aug 05, 2025) · proprietary	60.0%
28	Claude Opus 4 (May 14, 2025, 12K) · proprietary	58.8%
29	Claude Opus 4 (May 14, 2025, unspecified) · proprietary	58.8%
30	Kimi K2.7 Code · 1058.6B	57.9%
31	GPT 5.2 Pro (Dec 11, 2025) · proprietary	57.4%
32	GPT 5 (Aug 07, 2025, high) · proprietary	56.7%
33	Grok 4.1 Fast Non Reasoning · proprietary	56.0%
34	Grok 4.1 Fast Reasoning · proprietary	56.0%
35	GLM 5.1 · 753.9B	55.1%
36	Claude Sonnet 4.5 (Sep 29, 2025, 12K) · proprietary	54.3%
37	Claude Sonnet 4.5 (Sep 29, 2025, unspecified) · proprietary	54.3%
38	GLM 5 · 753.9B	53.2%
39	GPT 5.1 (Nov 13, 2025, high) · proprietary	53.2%
40	O3 (Apr 16, 2025, high) · proprietary	53.1%
41	DeepSeek V3.2 Speciale · 685.4B	52.6%
42	Gemini 2.5 Pro Exp (Mar 25) · proprietary	51.6%
43	Gemini 2.5 Pro Preview (Mar 25) · proprietary	51.6%
44	GPT 5.6 Terra (unspecified) · proprietary	48.9%
45	GLM 4.7 · 358.3B	47.7%
46	GPT 5.6 Luna (unspecified) · proprietary	46.8%
47	Kimi K2.5 · 1058.6B	46.8%
48	Claude 3.7 Sonnet (Feb 19, 2025, 12K) · proprietary	46.4%
49	Claude 3.7 Sonnet (Feb 19, 2025, unspecified) · proprietary	46.4%
50	GPT 5.2 (Dec 11, 2025, high) · proprietary	45.8%
51	GPT 5.2 (Dec 11, 2025, unspecified) · proprietary	45.8%
52	MiniMax M3 · 427.0B	45.8%
53	Claude Sonnet 4 (May 14, 2025, 12K) · proprietary	45.5%
54	Claude Sonnet 4 (May 14, 2025, unspecified) · proprietary	45.5%
55	Claude 3.7 Sonnet (Feb 19, 2025) · proprietary	44.9%
56	O1 Preview (Sep 12, 2024) · proprietary	41.7%
57	Claude 3.5 Sonnet (Oct 22, 2024) · proprietary	41.4%
58	Gemini 2.5 Flash · proprietary	41.2%
59	DeepSeek R1 0528 · 684.5B	40.8%
60	O1 (Dec 17, 2024, high) · proprietary	40.1%
61	DeepSeek V3.1 · 684.5B	40.0%
62	O4 Mini (Apr 16, 2025, high) · proprietary	38.7%
63	O1 (Dec 17, 2024, medium) · proprietary	36.7%
64	Grok 3 · proprietary	36.1%
65	Qwen3.6 Flash · proprietary	35.2%
66	GPT 4.5 Preview (Feb 27, 2025) · proprietary	34.5%
67	Gemini Exp (Dec 06) · proprietary	31.1%
68	Qwen3 235B A22B · 235.1B	31.0%
69	DeepSeek R1 · 684.5B	30.9%
70	Gemini 2.0 Flash Thinking Exp (Jan 21) · proprietary	30.7%
71	Llama 4 Maverick 17B 128E Instruct · 401.6B	27.7%
72	Claude 3.5 Sonnet (Jun 20, 2024) · proprietary	27.5%
73	DeepSeek v3 0324 · 684.5B	27.2%
74	Gemini 1.5 Pro 002 · proprietary	27.1%
75	GPT 4.1 (Apr 14, 2025) · proprietary	27.0%
76	Kimi K2 Instruct · 1026.5B	26.3%
77	GPT 4 Turbo (Apr 09, 2024) · proprietary	25.1%
78	Claude 3 Opus (Feb 29, 2024) · proprietary	23.5%
79	Llama 3.1 405B Instruct · 405.9B	23.0%
80	O3 Mini (Jan 31, 2025, high) · proprietary	22.8%
81	Grok 2 (Dec 12) · proprietary	22.7%
82	Mistral Large 2407 · proprietary	22.5%
83	GPT OSS 120B · 120.4B	22.1%
84	Llama 3.3 70B Instruct · 70.6B	19.9%
85	DeepSeek v3 · 684.5B	18.9%
86	Gemini 2.0 Flash Exp · proprietary	18.9%
87	O1 Mini (Sep 12, 2024, medium) · proprietary	18.1%
88	GPT 4o (Aug 06, 2024) · proprietary	17.8%
89	C4ai Command R Plus (Aug 2024) · proprietary	17.4%
90	GPT 4o Mini (Jul 18, 2024) · proprietary	10.7%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

SimpleBench: frequently asked questions

What is the best open LLM on SimpleBench?: DeepSeek V4 Pro is the top open model on SimpleBench, scoring 61.2%. Among all models tested — including proprietary ones — it ranks #22. The top model overall is Claude Fable 5 Max (Anthropic) at 81.9%.
Can open models match proprietary models on SimpleBench?: Not quite on SimpleBench: the strongest proprietary model (Claude Fable 5 Max) scores 81.9%, ahead of the best open model (DeepSeek V4 Pro) at 61.2% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.