What is the best open LLM on SimpleQA?

DeepSeek V4 Pro is the top open model on SimpleQA, scoring 57.0%. Among all models tested — including proprietary ones — it ranks #12. The top model overall is Gemini 3.1 Pro Preview (Google DeepMind) at 77.3%.

What's the best SimpleQA model you can run on a 24 GB GPU?

Gemma 4 31B IT is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 9.6% on SimpleQA.

Can open models match proprietary models on SimpleQA?

Not quite on SimpleQA: the strongest proprietary model (Gemini 3.1 Pro Preview) scores 77.3%, ahead of the best open model (DeepSeek V4 Pro) at 57.0% — but you can run the open one yourself.

Knowledge

SimpleQA Leaderboard

Name: SimpleQA — open LLM scores
Creator: epoch

SimpleQA measures factual accuracy on short, fact-seeking questions with a single correct answer — directly probing how often a model is right versus confidently wrong (hallucination) on simple facts.

Source: epoch11 open models ranked+54 proprietaryData through Jul 2026

Open models All models

All models ranked on SimpleQA

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#	Model	Score
1	Gemini 3.1 Pro Preview · proprietary	77.3%
2	Gemini 3 Pro Preview · proprietary	72.9%
3	GPT 5.6 Sol Max · proprietary	71.6%
4	Gemini 3.5 Flash (high) · proprietary	68.4%
5	Claude Fable 5 (xhigh) · proprietary	68.3%
6	Qwen3 Max (Sep 23, 2025) · proprietary	67.5%
7	Gemini 3 Flash Preview · proprietary	67.4%
8	Muse Spark · proprietary	66.3%
9	GPT 5.5 Pro Pre Release (xhigh) · proprietary	64.5%
10	GPT 5.5 Pre Release (xhigh) · proprietary	63.1%
11	Qwen3.7 Max · proprietary	58.5%
12	DeepSeek V4 Pro · 861.6B	57.0%
13	Qwen3.6 Max Preview · proprietary	56.9%
14	Gemini 2.5 Pro · proprietary	56.0%
15	Grok 4.5 (high) · proprietary	53.5%
16	O3 (Apr 16, 2025, high) · proprietary	53.0%
17	Claude Opus 4.7 (xhigh) · proprietary	50.6%
18	GPT 5 (Aug 07, 2025, high) · proprietary	50.6%
19	Qwen3 235B A22B Thinking 2507 · 235.1B	50.1%
20	Qwen3.6 Plus · proprietary	49.1%
21	GPT 5.1 (Nov 13, 2025, high) · proprietary	48.9%
22	Grok 4 (Jul 09) · proprietary	47.9%
23	GPT 5.4 Pro (Mar 05, 2026, xhigh) · proprietary	47.8%
24	Claude Opus 4.6 (32K) · proprietary	46.5%
25	GPT 5.4 (Mar 05, 2026, xhigh) · proprietary	44.8%
26	Claude Opus 4.6 · proprietary	43.1%
27	GPT 5.6 Terra Max · proprietary	43.1%
28	Kimi K3 Max · proprietary	42.7%
29	Claude Opus 4.5 (Nov 01, 2025, 32K) · proprietary	41.8%
30	GPT 5.6 Luna Max · proprietary	41.7%
31	Claude Opus 4.6 Max · proprietary	41.0%
32	Claude Opus 4.8 Max · proprietary	39.5%
33	Kimi K2.7 Code · 1058.6B	39.2%
34	GPT 5.2 (Dec 11, 2025, xhigh) · proprietary	38.9%
35	Kimi K2.6 · 1058.6B	38.7%
36	GPT 5.2 (Dec 11, 2025, high) · proprietary	38.2%
37	GLM 5.2 Max · proprietary	38.1%
38	Grok 4.3 (high) · proprietary	38.0%
39	GLM 5.1 · 753.9B	37.3%
40	GPT 5.2 (Dec 11, 2025, medium) · proprietary	35.4%
41	Grok 4.20 0309 Reasoning · proprietary	35.1%
42	Claude Opus 4.1 (Aug 05, 2025, 27K) · proprietary	34.8%
43	GPT 5.2 (Dec 11, 2025, low) · proprietary	34.7%
44	Kimi K2.5 · 1058.6B	33.9%
45	Kimi K2 Thinking · 1058.1B	31.6%
46	GLM 4.7 · 358.3B	31.5%
47	Claude Sonnet 4.6 (32K) · proprietary	29.0%
48	GPT 5.4 Mini (Mar 17, 2026, high) · proprietary	28.6%
49	DeepSeek Reasoner · proprietary	27.5%
50	DeepSeek R1 0528 · 684.5B	27.4%
51	Qwen3.5 Plus · proprietary	26.0%
52	Claude Sonnet 5 (xhigh) · proprietary	25.0%
53	O4 Mini (Apr 16, 2025, high) · proprietary	23.9%
54	Claude Sonnet 4.5 (Sep 29, 2025, 59K) · proprietary	23.6%
55	Qwen3.6 Flash · proprietary	21.2%
56	Grok 3 Mini Beta (high) · proprietary	21.1%
57	GPT 5 Mini (Aug 07, 2025, high) · proprietary	21.0%
58	Qwen3.5 Flash · proprietary	19.8%
59	GPT OSS 120B · 120.4B	13.9%
60	Claude Sonnet 4.5 (Sep 29, 2025) · proprietary	13.0%
61	GPT 5 Nano (Aug 07, 2025, high) · proprietary	12.2%
62	GPT 5.4 Nano (Mar 17, 2026, high) · proprietary	12.0%
63	Gemma 4 31B IT · 32.7B	9.6%
64	Claude 3.5 Haiku (Oct 22, 2024) · proprietary	6.7%
65	Claude Haiku 4.5 (Oct 01, 2025, 32K) · proprietary	5.9%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

SimpleQA: frequently asked questions

What is the best open LLM on SimpleQA?: DeepSeek V4 Pro is the top open model on SimpleQA, scoring 57.0%. Among all models tested — including proprietary ones — it ranks #12. The top model overall is Gemini 3.1 Pro Preview (Google DeepMind) at 77.3%.
What's the best SimpleQA model you can run on a 24 GB GPU?: Gemma 4 31B IT is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 18 GB), scoring 9.6% on SimpleQA.
Can open models match proprietary models on SimpleQA?: Not quite on SimpleQA: the strongest proprietary model (Gemini 3.1 Pro Preview) scores 77.3%, ahead of the best open model (DeepSeek V4 Pro) at 57.0% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.