Reasoning
SimpleBench Leaderboard
SimpleBench is a set of everyday, common-sense and trick questions that humans answer easily but language models often get wrong. It probes basic reasoning and robustness rather than specialist knowledge.
Source: epoch10 open models ranked+66 proprietaryData through Apr 2026
All models ranked on SimpleBench
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | gemini-3.1-pro-preview · proprietary | 79.6% |
| 2 | gpt-5.5-pro_unknown · proprietary | 76.9% |
| 3 | gemini-3-pro-preview · proprietary | 76.4% |
| 4 | gpt-5.4-pro-2026-03-05_unknown · proprietary | 74.1% |
| 5 | gpt-5.5_unknown · proprietary | 69.0% |
| 6 | claude-opus-4-6 · proprietary | 67.6% |
| 7 | claude-opus-4-7_unknown · proprietary | 62.9% |
| 8 | gemini-2.5-pro-preview-06-05 · proprietary | 62.4% |
| 9 | claude-opus-4-5-20251101 · proprietary | 62.0% |
| 10 | claude-opus-4-5-20251101_unknown · proprietary | 62.0% |
| 11 | gpt-5-pro-2025-10-06_high · proprietary | 61.6% |
| 12 | gpt-5-pro-2025-10-06_unknown · proprietary | 61.6% |
| 13 | deepseek-v4-pro_unknown · proprietary | 61.2% |
| 14 | gemini-3-flash-preview · proprietary | 61.1% |
| 15 | grok-4-0709 · proprietary | 60.5% |
| 16 | claude-opus-4-1-20250805 · proprietary | 60.0% |
| 17 | claude-opus-4-1-20250805_unknown · proprietary | 60.0% |
| 18 | claude-opus-4-20250514_12K · proprietary | 58.8% |
| 19 | claude-opus-4-20250514_unknown · proprietary | 58.8% |
| 20 | GLM 5.1 · 753.9B | 58.7% |
| 21 | gpt-5.2-pro-2025-12-11 · proprietary | 57.4% |
| 22 | gpt-5-2025-08-07_high · proprietary | 56.7% |
| 23 | grok-4-1-fast-non-reasoning · proprietary | 56.0% |
| 24 | grok-4-1-fast-reasoning · proprietary | 56.0% |
| 25 | claude-sonnet-4-5-20250929_12K · proprietary | 54.3% |
| 26 | claude-sonnet-4-5-20250929_unknown · proprietary | 54.3% |
| 27 | GLM 5 · 753.9B | 53.2% |
| 28 | gpt-5.1-2025-11-13_high · proprietary | 53.2% |
| 29 | o3-2025-04-16_high · proprietary | 53.1% |
| 30 | DeepSeek-V3.2-Speciale · proprietary | 52.6% |
| 31 | gemini-2.5-pro-exp-03-25 · proprietary | 51.6% |
| 32 | gemini-2.5-pro-preview-03-25 · proprietary | 51.6% |
| 33 | GLM 4.7 · 358.3B | 47.7% |
| 34 | zai-org/glm-4.7 · proprietary | 47.7% |
| 35 | kimi-k2.5 · proprietary | 46.8% |
| 36 | claude-3-7-sonnet-20250219_12K · proprietary | 46.4% |
| 37 | claude-3-7-sonnet-20250219_unknown · proprietary | 46.4% |
| 38 | gpt-5.2-2025-12-11_high · proprietary | 45.8% |
| 39 | gpt-5.2-2025-12-11_unknown · proprietary | 45.8% |
| 40 | claude-sonnet-4-20250514_12K · proprietary | 45.5% |
| 41 | claude-sonnet-4-20250514_unknown · proprietary | 45.5% |
| 42 | claude-3-7-sonnet-20250219 · proprietary | 44.9% |
| 43 | o1-preview-2024-09-12 · proprietary | 41.7% |
| 44 | claude-3-5-sonnet-20241022 · proprietary | 41.4% |
| 45 | gemini-2.5-flash · proprietary | 41.2% |
| 46 | DeepSeek R1 0528 · 684.5B | 40.8% |
| 47 | o1-2024-12-17_high · proprietary | 40.1% |
| 48 | DeepSeek-V3.1 · proprietary | 40.0% |
| 49 | o4-mini-2025-04-16_high · proprietary | 38.7% |
| 50 | o1-2024-12-17_medium · proprietary | 36.7% |
| 51 | grok-3 · proprietary | 36.1% |
| 52 | gpt-4.5-preview-2025-02-27 · proprietary | 34.5% |
| 53 | gemini-exp-1206 · proprietary | 31.1% |
| 54 | Qwen3 235B A22B · 235.1B | 31.0% |
| 55 | DeepSeek R1 · 684.5B | 30.9% |
| 56 | gemini-2.0-flash-thinking-exp-01-21 · proprietary | 30.7% |
| 57 | Llama-4-Maverick-17B-128E-Instruct · proprietary | 27.7% |
| 58 | claude-3-5-sonnet-20240620 · proprietary | 27.5% |
| 59 | DeepSeek v3 0324 · 684.5B | 27.2% |
| 60 | gemini-1.5-pro-002 · proprietary | 27.1% |
| 61 | gpt-4.1-2025-04-14 · proprietary | 27.0% |
| 62 | Kimi K2 Instruct · 1026.5B | 26.3% |
| 63 | gpt-4-turbo-2024-04-09 · proprietary | 25.1% |
| 64 | claude-3-opus-20240229 · proprietary | 23.5% |
| 65 | Llama-3.1-405B-Instruct · proprietary | 23.0% |
| 66 | o3-mini-2025-01-31_high · proprietary | 22.8% |
| 67 | grok-2-1212 · proprietary | 22.7% |
| 68 | mistral-large-2407 · proprietary | 22.5% |
| 69 | GPT OSS 120B · 120.4B | 22.1% |
| 70 | Llama 3.3 70B Instruct · 70.6B | 19.9% |
| 71 | DeepSeek-V3 · proprietary | 18.9% |
| 72 | gemini-2.0-flash-exp · proprietary | 18.9% |
| 73 | o1-mini-2024-09-12_medium · proprietary | 18.1% |
| 74 | gpt-4o-2024-08-06 · proprietary | 17.8% |
| 75 | c4ai-command-r-plus-08-2024 · proprietary | 17.4% |
| 76 | gpt-4o-mini-2024-07-18 · proprietary | 10.7% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Llama 3.3 70B Instruct, 71B, score 19.9% — on the efficiency frontier (best score at its size or smaller).
- GPT OSS 120B, 120B, score 22.1% — on the efficiency frontier (best score at its size or smaller).
- Qwen3 235B A22B, 235B, score 31.0% — on the efficiency frontier (best score at its size or smaller).
- GLM 4.7, 358B, score 47.7% — on the efficiency frontier (best score at its size or smaller).
- GLM 5.1, 754B, score 58.7% — on the efficiency frontier (best score at its size or smaller).
SimpleBench: frequently asked questions
- What is the best open LLM on SimpleBench?
- GLM 5.1 is the top open model on SimpleBench, scoring 58.7%. Among all models tested — including proprietary ones — it ranks #20.
- Can open models match proprietary models on SimpleBench?
- Not quite on SimpleBench: the strongest proprietary model (gemini-3.1-pro-preview) scores 79.6%, ahead of the best open model (GLM 5.1) at 58.7% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.