Reasoning
SimpleBench Leaderboard
SimpleBench is a set of everyday, common-sense and trick questions that humans answer easily but language models often get wrong. It probes basic reasoning and robustness rather than specialist knowledge.
Source: epoch10 open models ranked+66 proprietaryData through Apr 2026
Open models ranked on SimpleBench
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 20 | GLM 5.1 · 753.9B | 58.7% |
| 2 / 27 | GLM 5 · 753.9B | 53.2% |
| 3 / 33 | GLM 4.7 · 358.3B | 47.7% |
| 4 / 46 | DeepSeek R1 0528 · 684.5B | 40.8% |
| 5 / 54 | Qwen3 235B A22B · 235.1B | 31.0% |
| 6 / 55 | DeepSeek R1 · 684.5B | 30.9% |
| 7 / 59 | DeepSeek v3 0324 · 684.5B | 27.2% |
| 8 / 62 | Kimi K2 Instruct · 1026.5B | 26.3% |
| 9 / 69 | GPT OSS 120B · 120.4B | 22.1% |
| 10 / 70 | Llama 3.3 70B Instruct · 70.6B | 19.9% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Llama 3.3 70B Instruct, 71B, score 19.9% — on the efficiency frontier (best score at its size or smaller).
- GPT OSS 120B, 120B, score 22.1% — on the efficiency frontier (best score at its size or smaller).
- Qwen3 235B A22B, 235B, score 31.0% — on the efficiency frontier (best score at its size or smaller).
- GLM 4.7, 358B, score 47.7% — on the efficiency frontier (best score at its size or smaller).
- GLM 5.1, 754B, score 58.7% — on the efficiency frontier (best score at its size or smaller).
SimpleBench: frequently asked questions
- What is the best open LLM on SimpleBench?
- GLM 5.1 is the top open model on SimpleBench, scoring 58.7%. Among all models tested — including proprietary ones — it ranks #20.
- Can open models match proprietary models on SimpleBench?
- Not quite on SimpleBench: the strongest proprietary model (gemini-3.1-pro-preview) scores 79.6%, ahead of the best open model (GLM 5.1) at 58.7% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.