Knowledge
SimpleQA Leaderboard
SimpleQA measures factual accuracy on short, fact-seeking questions with a single correct answer — directly probing how often a model is right versus confidently wrong (hallucination) on simple facts.
Source: epoch4 open models ranked+47 proprietaryData through May 2026
All models ranked on SimpleQA
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | gemini-3.1-pro-preview · proprietary | 77.3% |
| 2 | gemini-3-pro-preview · proprietary | 72.9% |
| 3 | gemini-3.5-flash_high · proprietary | 68.4% |
| 4 | qwen3-max-2025-09-23 · proprietary | 67.5% |
| 5 | gemini-3-flash-preview · proprietary | 67.4% |
| 6 | muse-spark · proprietary | 66.3% |
| 7 | gpt-5.5-pro-pre-release_xhigh · proprietary | 64.5% |
| 8 | gpt-5.5-pre-release_xhigh · proprietary | 63.1% |
| 9 | qwen3.6-max-preview · proprietary | 56.9% |
| 10 | gemini-2.5-pro · proprietary | 56.0% |
| 11 | o3-2025-04-16_high · proprietary | 53.0% |
| 12 | claude-opus-4-7_xhigh · proprietary | 50.6% |
| 13 | gpt-5-2025-08-07_high · proprietary | 50.6% |
| 14 | Qwen3 235B A22B Thinking 2507 · 235.1B | 50.1% |
| 15 | qwen3.6-plus · proprietary | 49.1% |
| 16 | gpt-5.1-2025-11-13_high · proprietary | 48.9% |
| 17 | grok-4-0709 · proprietary | 47.9% |
| 18 | gpt-5.4-pro-2026-03-05_xhigh · proprietary | 47.8% |
| 19 | claude-opus-4-6_32K · proprietary | 46.5% |
| 20 | gpt-5.4-2026-03-05_xhigh · proprietary | 44.8% |
| 21 | claude-opus-4-6 · proprietary | 43.1% |
| 22 | claude-opus-4-5-20251101_32K · proprietary | 41.8% |
| 23 | claude-opus-4-6_max · proprietary | 41.0% |
| 24 | gpt-5.2-2025-12-11_xhigh · proprietary | 38.9% |
| 25 | kimi-k2.6 · proprietary | 38.7% |
| 26 | gpt-5.2-2025-12-11_high · proprietary | 38.2% |
| 27 | GLM 5.1 · 753.9B | 37.3% |
| 28 | gpt-5.2-2025-12-11_medium · proprietary | 35.4% |
| 29 | claude-opus-4-1-20250805_27K · proprietary | 34.8% |
| 30 | gpt-5.2-2025-12-11_low · proprietary | 34.7% |
| 31 | fireworks/kimi-k2p5 · proprietary | 33.9% |
| 32 | kimi-k2-thinking-turbo · proprietary | 31.6% |
| 33 | GLM 4.7 · 358.3B | 31.5% |
| 34 | claude-sonnet-4-6_32K · proprietary | 29.0% |
| 35 | gpt-5.4-mini-2026-03-17_high · proprietary | 28.6% |
| 36 | deepseek-reasoner · proprietary | 27.5% |
| 37 | DeepSeek R1 0528 · 684.5B | 27.4% |
| 38 | qwen3.5-plus · proprietary | 26.0% |
| 39 | o4-mini-2025-04-16_high · proprietary | 23.9% |
| 40 | claude-sonnet-4-5-20250929_59K · proprietary | 23.6% |
| 41 | qwen3.6-flash · proprietary | 21.2% |
| 42 | grok-3-mini-beta_high · proprietary | 21.1% |
| 43 | gpt-5-mini-2025-08-07_high · proprietary | 21.0% |
| 44 | qwen3.5-flash · proprietary | 19.8% |
| 45 | openai/gpt-oss-120b_high · proprietary | 13.9% |
| 46 | claude-sonnet-4-5-20250929 · proprietary | 13.0% |
| 47 | gpt-5-nano-2025-08-07_high · proprietary | 12.2% |
| 48 | gpt-5.4-nano-2026-03-17_high · proprietary | 12.0% |
| 49 | gemma-4-31b-it · proprietary | 9.6% |
| 50 | claude-3-5-haiku-20241022 · proprietary | 6.7% |
| 51 | claude-haiku-4-5-20251001_32K · proprietary | 5.9% |
SimpleQA: frequently asked questions
- What is the best open LLM on SimpleQA?
- Qwen3 235B A22B Thinking 2507 is the top open model on SimpleQA, scoring 50.1%. Among all models tested — including proprietary ones — it ranks #14.
- Can open models match proprietary models on SimpleQA?
- Not quite on SimpleQA: the strongest proprietary model (gemini-3.1-pro-preview) scores 77.3%, ahead of the best open model (Qwen3 235B A22B Thinking 2507) at 50.1% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.