Reasoning

SimpleBench Leaderboard

SimpleBench is a set of everyday, common-sense and trick questions that humans answer easily but language models often get wrong. It probes basic reasoning and robustness rather than specialist knowledge.

Source: epoch10 open models ranked+66 proprietaryData through Apr 2026

Open models ranked on SimpleBench

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 20GLM 5.1 · 753.9B
58.7%
2 / 27GLM 5 · 753.9B
53.2%
3 / 33GLM 4.7 · 358.3B
47.7%
4 / 46DeepSeek R1 0528 · 684.5B
40.8%
5 / 54Qwen3 235B A22B · 235.1B
31.0%
6 / 55DeepSeek R1 · 684.5B
30.9%
7 / 59DeepSeek v3 0324 · 684.5B
27.2%
8 / 62Kimi K2 Instruct · 1026.5B
26.3%
9 / 69GPT OSS 120B · 120.4B
22.1%
10 / 70Llama 3.3 70B Instruct · 70.6B
19.9%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

100B1Tmodel size (log scale) →58.7%19.9%GLM 5 · 754B · 53.2%DeepSeek R1 0528 · 685B · 40.8%DeepSeek R1 · 685B · 30.9%DeepSeek v3 0324 · 685B · 27.2%Kimi K2 Instruct · 1T · 26.3%Llama 3.3 70B Instruct · 71B · 19.9%Llama 3.3 70B InstructGPT OSS 120B · 120B · 22.1%GPT OSS 120BQwen3 235B A22B · 235B · 31.0%Qwen3 235B A22BGLM 4.7 · 358B · 47.7%GLM 4.7GLM 5.1 · 754B · 58.7%GLM 5.1
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Llama 3.3 70B Instruct, 71B, score 19.9% — on the efficiency frontier (best score at its size or smaller).
  • GPT OSS 120B, 120B, score 22.1% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3 235B A22B, 235B, score 31.0% — on the efficiency frontier (best score at its size or smaller).
  • GLM 4.7, 358B, score 47.7% — on the efficiency frontier (best score at its size or smaller).
  • GLM 5.1, 754B, score 58.7% — on the efficiency frontier (best score at its size or smaller).

SimpleBench: frequently asked questions

What is the best open LLM on SimpleBench?
GLM 5.1 is the top open model on SimpleBench, scoring 58.7%. Among all models tested — including proprietary ones — it ranks #20.
Can open models match proprietary models on SimpleBench?
Not quite on SimpleBench: the strongest proprietary model (gemini-3.1-pro-preview) scores 79.6%, ahead of the best open model (GLM 5.1) at 58.7% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.