Reasoning

SimpleBench Leaderboard

SimpleBench is a set of everyday, common-sense and trick questions that humans answer easily but language models often get wrong. It probes basic reasoning and robustness rather than specialist knowledge.

Source: epoch10 open models ranked+66 proprietaryData through Apr 2026

All models ranked on SimpleBench

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gemini-3.1-pro-preview · proprietary
79.6%
2gpt-5.5-pro_unknown · proprietary
76.9%
3gemini-3-pro-preview · proprietary
76.4%
4gpt-5.4-pro-2026-03-05_unknown · proprietary
74.1%
5gpt-5.5_unknown · proprietary
69.0%
6claude-opus-4-6 · proprietary
67.6%
7claude-opus-4-7_unknown · proprietary
62.9%
8gemini-2.5-pro-preview-06-05 · proprietary
62.4%
9claude-opus-4-5-20251101 · proprietary
62.0%
10claude-opus-4-5-20251101_unknown · proprietary
62.0%
11gpt-5-pro-2025-10-06_high · proprietary
61.6%
12gpt-5-pro-2025-10-06_unknown · proprietary
61.6%
13deepseek-v4-pro_unknown · proprietary
61.2%
14gemini-3-flash-preview · proprietary
61.1%
15grok-4-0709 · proprietary
60.5%
16claude-opus-4-1-20250805 · proprietary
60.0%
17claude-opus-4-1-20250805_unknown · proprietary
60.0%
18claude-opus-4-20250514_12K · proprietary
58.8%
19claude-opus-4-20250514_unknown · proprietary
58.8%
20GLM 5.1 · 753.9B
58.7%
21gpt-5.2-pro-2025-12-11 · proprietary
57.4%
22gpt-5-2025-08-07_high · proprietary
56.7%
23grok-4-1-fast-non-reasoning · proprietary
56.0%
24grok-4-1-fast-reasoning · proprietary
56.0%
25claude-sonnet-4-5-20250929_12K · proprietary
54.3%
26claude-sonnet-4-5-20250929_unknown · proprietary
54.3%
27GLM 5 · 753.9B
53.2%
28gpt-5.1-2025-11-13_high · proprietary
53.2%
29o3-2025-04-16_high · proprietary
53.1%
30DeepSeek-V3.2-Speciale · proprietary
52.6%
31gemini-2.5-pro-exp-03-25 · proprietary
51.6%
32gemini-2.5-pro-preview-03-25 · proprietary
51.6%
33GLM 4.7 · 358.3B
47.7%
34zai-org/glm-4.7 · proprietary
47.7%
35kimi-k2.5 · proprietary
46.8%
36claude-3-7-sonnet-20250219_12K · proprietary
46.4%
37claude-3-7-sonnet-20250219_unknown · proprietary
46.4%
38gpt-5.2-2025-12-11_high · proprietary
45.8%
39gpt-5.2-2025-12-11_unknown · proprietary
45.8%
40claude-sonnet-4-20250514_12K · proprietary
45.5%
41claude-sonnet-4-20250514_unknown · proprietary
45.5%
42claude-3-7-sonnet-20250219 · proprietary
44.9%
43o1-preview-2024-09-12 · proprietary
41.7%
44claude-3-5-sonnet-20241022 · proprietary
41.4%
45gemini-2.5-flash · proprietary
41.2%
46DeepSeek R1 0528 · 684.5B
40.8%
47o1-2024-12-17_high · proprietary
40.1%
48DeepSeek-V3.1 · proprietary
40.0%
49o4-mini-2025-04-16_high · proprietary
38.7%
50o1-2024-12-17_medium · proprietary
36.7%
51grok-3 · proprietary
36.1%
52gpt-4.5-preview-2025-02-27 · proprietary
34.5%
53gemini-exp-1206 · proprietary
31.1%
54Qwen3 235B A22B · 235.1B
31.0%
55DeepSeek R1 · 684.5B
30.9%
56gemini-2.0-flash-thinking-exp-01-21 · proprietary
30.7%
57Llama-4-Maverick-17B-128E-Instruct · proprietary
27.7%
58claude-3-5-sonnet-20240620 · proprietary
27.5%
59DeepSeek v3 0324 · 684.5B
27.2%
60gemini-1.5-pro-002 · proprietary
27.1%
61gpt-4.1-2025-04-14 · proprietary
27.0%
62Kimi K2 Instruct · 1026.5B
26.3%
63gpt-4-turbo-2024-04-09 · proprietary
25.1%
64claude-3-opus-20240229 · proprietary
23.5%
65Llama-3.1-405B-Instruct · proprietary
23.0%
66o3-mini-2025-01-31_high · proprietary
22.8%
67grok-2-1212 · proprietary
22.7%
68mistral-large-2407 · proprietary
22.5%
69GPT OSS 120B · 120.4B
22.1%
70Llama 3.3 70B Instruct · 70.6B
19.9%
71DeepSeek-V3 · proprietary
18.9%
72gemini-2.0-flash-exp · proprietary
18.9%
73o1-mini-2024-09-12_medium · proprietary
18.1%
74gpt-4o-2024-08-06 · proprietary
17.8%
75c4ai-command-r-plus-08-2024 · proprietary
17.4%
76gpt-4o-mini-2024-07-18 · proprietary
10.7%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

100B1Tmodel size (log scale) →58.7%19.9%GLM 5 · 754B · 53.2%DeepSeek R1 0528 · 685B · 40.8%DeepSeek R1 · 685B · 30.9%DeepSeek v3 0324 · 685B · 27.2%Kimi K2 Instruct · 1T · 26.3%Llama 3.3 70B Instruct · 71B · 19.9%Llama 3.3 70B InstructGPT OSS 120B · 120B · 22.1%GPT OSS 120BQwen3 235B A22B · 235B · 31.0%Qwen3 235B A22BGLM 4.7 · 358B · 47.7%GLM 4.7GLM 5.1 · 754B · 58.7%GLM 5.1
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Llama 3.3 70B Instruct, 71B, score 19.9% — on the efficiency frontier (best score at its size or smaller).
  • GPT OSS 120B, 120B, score 22.1% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3 235B A22B, 235B, score 31.0% — on the efficiency frontier (best score at its size or smaller).
  • GLM 4.7, 358B, score 47.7% — on the efficiency frontier (best score at its size or smaller).
  • GLM 5.1, 754B, score 58.7% — on the efficiency frontier (best score at its size or smaller).

SimpleBench: frequently asked questions

What is the best open LLM on SimpleBench?
GLM 5.1 is the top open model on SimpleBench, scoring 58.7%. Among all models tested — including proprietary ones — it ranks #20.
Can open models match proprietary models on SimpleBench?
Not quite on SimpleBench: the strongest proprietary model (gemini-3.1-pro-preview) scores 79.6%, ahead of the best open model (GLM 5.1) at 58.7% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.