What is the best open LLM on GPQA Diamond?

Kimi K2.6 is the top open model on GPQA Diamond, scoring 90.8%. Among all models tested — including proprietary ones — it ranks #17. The top model overall is GPT 5.4 Pro (Mar 05, 2026, xhigh) (OpenAI) at 94.6%.

What's the best GPQA Diamond model you can run on a 24 GB GPU?

Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.

What's the best GPQA Diamond model you can run on a 12 GB GPU?

Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.

Can open models match proprietary models on GPQA Diamond?

Not quite on GPQA Diamond: the strongest proprietary model (GPT 5.4 Pro (Mar 05, 2026, xhigh)) scores 94.6%, ahead of the best open model (Kimi K2.6) at 90.8% — but you can run the open one yourself.

Reasoning

GPQA Diamond Leaderboard

Name: GPQA Diamond — open LLM scores
Creator: epoch

GPQA Diamond is a set of extremely hard, graduate-level science questions (physics, chemistry, biology) written by domain experts and filtered so that skilled non-experts with web access still fail. It measures genuine reasoning rather than memorization.

Source: epoch46 open models ranked+136 proprietaryData through Jul 2026

Open models All models

Open models ranked on GPQA Diamond

# shows rank among open models / rank overall (including proprietary).

#	Model	Score
1 / 17	Kimi K2.6 · 1058.6B	90.8%
2 / 23	DeepSeek V4 Pro · 861.6B	89.6%
3 / 24	Kimi K2.7 Code · 1058.6B	89.5%
4 / 31	GLM 5 · 753.9B	87.8%
5 / 33	Kimi K2.5 · 1058.6B	87.6%
6 / 40	GLM 5.1 · 753.9B	85.5%
7 / 46	Kimi K2 Thinking · 1058.1B	84.2%
8 / 52	GLM 4.7 · 358.3B	83.3%
9 / 59	Qwen3 235B A22B Thinking 2507 · 235.1B	80.0%
10 / 72	DeepSeek R1 0528 · 684.5B	76.3%
11 / 76	GPT OSS 120B · 120.4B	75.8%
12 / 86	Qwen3 235B A22B · 235.1B	70.7%
13 / 88	DeepSeek R1 · 684.5B	69.2%
14 / 91	DeepSeek v3 0324 · 684.5B	67.6%
15 / 94	Llama 4 Maverick 17B 128E Instruct · 401.6B	67.0%
16 / 109	DeepSeek v3 · 684.5B	56.5%
17 / 111	Phi 4 · 14.7B	56.1%
18 / 112	DeepSeek R1 Distill Llama 70B · 70.6B	55.7%
19 / 116	Llama 4 Scout 17B 16E Instruct · 108.6B	51.8%
20 / 118	Llama 3.1 405B Instruct · 405.9B	50.9%
21 / 121	Qwen2.5 72B Instruct · 72.7B	49.1%
22 / 125	Gemma 3 27B IT · 27.4B	48.9%
23 / 126	Magistral Small 2506 · 23.6B	48.4%
24 / 130	Llama 3.3 70B Instruct · 70.6B	47.4%
25 / 134	Llama 3.1 Tulu 3 70B DPO · 70.6B	46.3%
26 / 135	Qwen2.5 32B Instruct · 32.8B	46.1%
27 / 138	DeepSeek R1 Distill Qwen 14B · 14.8B	44.7%
28 / 139	Llama 3.1 70B Instruct · 70.6B	44.2%
29 / 140	WizardLM 2 8x22B · 140.6B	43.4%
30 / 144	Llama 3.2 90B Vision Instruct · 88.6B	41.0%
31 / 145	Qwen2 72B Instruct · 72.7B	40.8%
32 / 147	Meta Llama 3 70B Instruct · 70.6B	40.6%
33 / 152	Hermes 2 Theta Llama 3 70B · 70.6B	37.5%
34 / 153	Gemma 2 27B IT · 27.2B	36.5%
35 / 159	Eurus 2 7B PRIME · 7.6B	33.9%
36 / 163	Yi 1.5 34B Chat · 34.4B	32.0%
37 / 164	Qwen1.5 32B Chat · 32.5B	30.7%
38 / 166	Mixtral 8x7B Instruct v0.1 · 46.7B	30.6%
39 / 169	Qwen1.5 72B Chat · 72.3B	28.8%
40 / 172	Gemma 2 9B IT · 9.2B	27.5%
41 / 175	Llama 2 70B Chat HF · 69.0B	26.3%
42 / 176	Meta Llama 3 8B Instruct · 8.0B	26.1%
43 / 177	Llama 3.1 8B Instruct · 8.0B	25.9%
44 / 179	Deepseek Llm 67B Chat · 67B	24.6%
45 / 180	Mistral 7B Instruct v0.3 · 7.2B	15.2%
46 / 181	Yi 34B Chat · 34.4B	14.7%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

GPQA Diamond: frequently asked questions

What is the best open LLM on GPQA Diamond?: Kimi K2.6 is the top open model on GPQA Diamond, scoring 90.8%. Among all models tested — including proprietary ones — it ranks #17. The top model overall is GPT 5.4 Pro (Mar 05, 2026, xhigh) (OpenAI) at 94.6%.
What's the best GPQA Diamond model you can run on a 24 GB GPU?: Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.
What's the best GPQA Diamond model you can run on a 12 GB GPU?: Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.
Can open models match proprietary models on GPQA Diamond?: Not quite on GPQA Diamond: the strongest proprietary model (GPT 5.4 Pro (Mar 05, 2026, xhigh)) scores 94.6%, ahead of the best open model (Kimi K2.6) at 90.8% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.