What is the best open LLM on GPQA Diamond?

Kimi K2.6 is the top open model on GPQA Diamond, scoring 90.8%. Among all models tested — including proprietary ones — it ranks #17. The top model overall is GPT 5.4 Pro (Mar 05, 2026, xhigh) (OpenAI) at 94.6%.

What's the best GPQA Diamond model you can run on a 24 GB GPU?

Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.

What's the best GPQA Diamond model you can run on a 12 GB GPU?

Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.

Can open models match proprietary models on GPQA Diamond?

Not quite on GPQA Diamond: the strongest proprietary model (GPT 5.4 Pro (Mar 05, 2026, xhigh)) scores 94.6%, ahead of the best open model (Kimi K2.6) at 90.8% — but you can run the open one yourself.

Reasoning

GPQA Diamond Leaderboard

Name: GPQA Diamond — open LLM scores
Creator: epoch

GPQA Diamond is a set of extremely hard, graduate-level science questions (physics, chemistry, biology) written by domain experts and filtered so that skilled non-experts with web access still fail. It measures genuine reasoning rather than memorization.

Source: epoch46 open models ranked+136 proprietaryData through Jul 2026

Open models All models

All models ranked on GPQA Diamond

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#	Model	Score
1	GPT 5.4 Pro (Mar 05, 2026, xhigh) · proprietary	94.6%
2	Gemini 3.1 Pro Preview · proprietary	94.1%
3	GPT 5.5 Pre Release (xhigh) · proprietary	94.0%
4	GPT 5.5 Pro Pre Release (xhigh) · proprietary	93.9%
5	GPT 5.6 Sol Max · proprietary	93.5%
6	Grok 4.5 (high) · proprietary	93.4%
7	GPT 5.6 Terra Max · proprietary	93.3%
8	GPT 5.4 (Mar 05, 2026, xhigh) · proprietary	93.3%
9	Kimi K3 Max · proprietary	93.1%
10	Gemini 3.5 Flash (high) · proprietary	92.8%
11	Gemini 3 Pro Preview · proprietary	92.6%
12	GLM 5.2 Max · proprietary	91.9%
13	GPT 5.6 Luna Max · proprietary	91.6%
14	Qwen3.7 Max · proprietary	91.6%
15	GPT 5.2 (Dec 11, 2025, xhigh) · proprietary	91.4%
16	Claude Opus 4.8 Max · proprietary	91.0%
17	Kimi K2.6 · 1058.6B	90.8%
18	GPT 5.5 (low) · proprietary	90.7%
19	Claude Opus 4.6 (32K) · proprietary	90.5%
20	Claude Sonnet 5 (xhigh) · proprietary	90.5%
21	Claude Opus 4.7 (xhigh) · proprietary	90.1%
22	Muse Spark · proprietary	89.8%
23	DeepSeek V4 Pro · 861.6B	89.6%
24	Kimi K2.7 Code · 1058.6B	89.5%
25	Grok 4.20 0309 Reasoning · proprietary	89.3%
26	Qwen3.6 Max Preview · proprietary	89.1%
27	Grok 4.3 (high) · proprietary	88.8%
28	Claude Opus 4.6 (64K) · proprietary	88.8%
29	GPT 5.2 (Dec 11, 2025, high) · proprietary	88.2%
30	GPT 5.2 (Dec 11, 2025, medium) · proprietary	87.9%
31	GLM 5 · 753.9B	87.8%
32	GPT 5.1 (Nov 13, 2025, high) · proprietary	87.6%
33	Kimi K2.5 · 1058.6B	87.6%
34	Claude Sonnet 4.6 (32K) · proprietary	87.4%
35	Qwen3.6 Plus · proprietary	87.4%
36	Grok 4 (Jul 09) · proprietary	87.0%
37	GPT 5 (Aug 07, 2025, high) · proprietary	86.2%
38	Claude Opus 4.5 (Nov 01, 2025, 32K) · proprietary	86.1%
39	Claude Opus 4.5 (Nov 01, 2025, 16K) · proprietary	85.5%
40	GLM 5.1 · 753.9B	85.5%
41	GPT 5 (Aug 07, 2025, medium) · proprietary	85.4%
42	Gemini 2.5 Pro · proprietary	85.3%
43	GPT 5.1 (Nov 13, 2025, medium) · proprietary	85.0%
44	Gemini 2.5 Pro Preview (Jun 05) · proprietary	84.9%
45	Qwen3.6 Flash · proprietary	84.4%
46	Kimi K2 Thinking · 1058.1B	84.2%
47	Qwen3.5 Plus · proprietary	84.2%
48	Gemini 2.5 Pro Exp (Mar 25) · proprietary	83.8%
49	Qwen3.5 Flash · proprietary	83.8%
50	GPT 5.4 Mini (Mar 17, 2026, high) · proprietary	83.6%
51	DeepSeek Reasoner · proprietary	83.4%
52	GLM 4.7 · 358.3B	83.3%
53	Gemini 3 Flash Preview · proprietary	83.2%
54	GPT 5.2 (Dec 11, 2025, low) · proprietary	82.7%
55	Claude Sonnet 4.5 (Sep 29, 2025, 59K) · proprietary	82.3%
56	O3 (Apr 16, 2025, high) · proprietary	81.8%
57	Claude Sonnet 4.5 (Sep 29, 2025, 32K) · proprietary	81.7%
58	Claude Opus 4.5 (Nov 01, 2025) · proprietary	80.7%
59	Qwen3 235B A22B Thinking 2507 · 235.1B	80.0%
60	O4 Mini (Apr 16, 2025, high) · proprietary	79.6%
61	Claude Sonnet 4.5 (Sep 29, 2025, 16K) · proprietary	78.8%
62	Claude 3.7 Sonnet (Feb 19, 2025, 64K) · proprietary	78.5%
63	GPT 5.4 Nano (Mar 17, 2026, high) · proprietary	78.5%
64	Claude Sonnet 4 (May 14, 2025, 32K) · proprietary	78.3%
65	Claude Sonnet 4 (May 14, 2025, 59K) · proprietary	77.8%
66	Claude Opus 4.1 (Aug 05, 2025, 16K) · proprietary	77.3%
67	O3 Mini (Jan 31, 2025, high) · proprietary	77.0%
68	Claude 3.7 Sonnet (Feb 19, 2025, 16K) · proprietary	76.8%
69	Claude 3.7 Sonnet (Feb 19, 2025, 32K) · proprietary	76.8%
70	Claude Opus 4.1 (Aug 05, 2025, 27K) · proprietary	76.8%
71	O1 (Dec 17, 2024, high) · proprietary	76.8%
72	DeepSeek R1 0528 · 684.5B	76.3%
73	Claude Opus 4 (May 14, 2025, 16K) · proprietary	76.3%
74	Grok 3 Mini Beta (low) · proprietary	76.3%
75	Claude Sonnet 4 (May 14, 2025, 16K) · proprietary	75.8%
76	GPT OSS 120B · 120.4B	75.8%
77	O1 (Dec 17, 2024, medium) · proprietary	75.8%
78	GPT 5 Mini (Aug 07, 2025, high) · proprietary	75.0%
79	Grok 3 Mini Beta (high) · proprietary	74.6%
80	O3 Mini (Jan 31, 2025, medium) · proprietary	74.3%
81	Claude Sonnet 4.5 (Sep 29, 2025) · proprietary	73.7%
82	Claude Opus 4.1 (Aug 05, 2025) · proprietary	73.2%
83	Qwen3 Max (Sep 23, 2025) · proprietary	72.6%
84	GPT 5 Mini (Aug 07, 2025, medium) · proprietary	71.7%
85	Claude Haiku 4.5 (Oct 01, 2025, 32K) · proprietary	71.2%
86	Qwen3 235B A22B · 235.1B	70.7%
87	GPT 5 Nano (Aug 07, 2025, high) · proprietary	69.4%
88	DeepSeek R1 · 684.5B	69.2%
89	Claude Opus 4 (May 14, 2025) · proprietary	69.2%
90	GPT 4.5 Preview (Feb 27, 2025) · proprietary	68.7%
91	DeepSeek v3 0324 · 684.5B	67.6%
92	Grok 3 Beta · proprietary	67.6%
93	GPT 5 Nano (Aug 07, 2025, medium) · proprietary	67.4%
94	Llama 4 Maverick 17B 128E Instruct · 401.6B	67.0%
95	GPT 4.1 (Apr 14, 2025) · proprietary	66.9%
96	Claude Sonnet 4 (May 14, 2025) · proprietary	66.7%
97	Gemini 2.5 Pro Preview (May 06) · proprietary	66.7%
98	Claude 3.7 Sonnet (Feb 19, 2025) · proprietary	66.0%
99	GPT 4.1 Mini (Apr 14, 2025) · proprietary	65.8%
100	Gemini 2.0 Pro Exp (Feb 05) · proprietary	65.7%
101	QwQ Plus · proprietary	65.4%
102	Gemini 2.0 Flash 001 · proprietary	64.1%
103	O1 Mini (Sep 12, 2024, high) · proprietary	62.4%
104	Claude Haiku 4.5 (Oct 01, 2025) · proprietary	60.5%
105	Mistral Medium 2505 · proprietary	59.5%
106	O1 Mini (Sep 12, 2024, medium) · proprietary	59.5%
107	Gemini 1.5 Pro 002 · proprietary	57.2%
108	Gemini 2.0 Flash Thinking Exp (Jan 21) · proprietary	57.1%
109	DeepSeek v3 · 684.5B	56.5%
110	Qwen Max (Jan 25, 2025) · proprietary	56.1%
111	Phi 4 · 14.7B	56.1%
112	DeepSeek R1 Distill Llama 70B · 70.6B	55.7%
113	Claude 3.5 Sonnet (Oct 22, 2024) · proprietary	55.3%
114	Claude 3.5 Sonnet (Jun 20, 2024) · proprietary	54.0%
115	Grok 2 (Dec 12) · proprietary	53.8%
116	Llama 4 Scout 17B 16E Instruct · 108.6B	51.8%
117	Mistral Large 2411 · proprietary	51.3%
118	Llama 3.1 405B Instruct · 405.9B	50.9%
119	O1 Preview (Sep 12, 2024) · proprietary	50.3%
120	GPT 4o (Aug 06, 2024) · proprietary	49.2%
121	Qwen2.5 72B Instruct · 72.7B	49.1%
122	Mistral Large 2407 · proprietary	49.0%
123	GPT 4.1 Nano (Apr 14, 2025) · proprietary	48.9%
124	GPT 4o (May 13, 2024) · proprietary	48.9%
125	Gemma 3 27B IT · 27.4B	48.9%
126	Magistral Small 2506 · 23.6B	48.4%
127	Qwen Plus (Jan 25, 2025) · proprietary	48.1%
128	GPT 4o (Nov 20, 2024) · proprietary	47.9%
129	Mistral Small 2503 · proprietary	47.5%
130	Llama 3.3 70B Instruct · 70.6B	47.4%
131	Gemini 1.5 Flash 002 · proprietary	47.3%
132	Claude 3 Opus (Feb 29, 2024) · proprietary	47.2%
133	GPT 4 Turbo (Apr 09, 2024) · proprietary	46.6%
134	Llama 3.1 Tulu 3 70B DPO · 70.6B	46.3%
135	Qwen2.5 32B Instruct · 32.8B	46.1%
136	Gemini 1.5 Pro 001 · proprietary	45.9%
137	Mistral Small 2501 · proprietary	45.3%
138	DeepSeek R1 Distill Qwen 14B · 14.8B	44.7%
139	Llama 3.1 70B Instruct · 70.6B	44.2%
140	WizardLM 2 8x22B · 140.6B	43.4%
141	GPT 4 1106 Preview · proprietary	42.4%
142	GPT 4 0125 Preview · proprietary	42.3%
143	Qwen Turbo (Nov 01, 2024) · proprietary	41.8%
144	Llama 3.2 90B Vision Instruct · 88.6B	41.0%
145	Qwen2 72B Instruct · 72.7B	40.8%
146	Claude 3 Sonnet (Feb 29, 2024) · proprietary	40.6%
147	Meta Llama 3 70B Instruct · 70.6B	40.6%
148	Gemini 1.5 Flash 001 · proprietary	40.4%
149	Mistral Large 2402 · proprietary	38.8%
150	Claude 3.5 Haiku (Oct 22, 2024) · proprietary	38.1%
151	GPT 4o Mini (Jul 18, 2024) · proprietary	37.7%
152	Hermes 2 Theta Llama 3 70B · 70.6B	37.5%
153	Gemma 2 27B IT · 27.2B	36.5%
154	Claude 3 Haiku (Mar 07, 2024) · proprietary	36.3%
155	GPT 4 (Mar 14) · proprietary	35.7%
156	Claude 2.0 · proprietary	34.7%
157	Open Mixtral 8x22b · proprietary	34.1%
158	Gemini 1.0 Pro 001 · proprietary	34.0%
159	Eurus 2 7B PRIME · 7.6B	33.9%
160	Claude 2.1 · proprietary	33.0%
161	Gemini 1.5 Flash 8B 001 · proprietary	33.0%
162	Dbrx Instruct · proprietary	32.9%
163	Yi 1.5 34B Chat · 34.4B	32.0%
164	Qwen1.5 32B Chat · 32.5B	30.7%
165	GPT 4 (Jun 13) · proprietary	30.6%
166	Mixtral 8x7B Instruct v0.1 · 46.7B	30.6%
167	Open Mistral Nemo 2407 · proprietary	29.9%
168	Open Mixtral 8x7b · proprietary	29.8%
169	Qwen1.5 72B Chat · 72.3B	28.8%
170	GPT 3.5 Turbo (Nov 06) · proprietary	28.0%
171	Phi 3 Medium 128K Instruct · proprietary	27.6%
172	Gemma 2 9B IT · 9.2B	27.5%
173	GPT 3.5 Turbo (Jan 25) · proprietary	27.2%
174	Ministral 8B 2410 · proprietary	27.2%
175	Llama 2 70B Chat HF · 69.0B	26.3%
176	Meta Llama 3 8B Instruct · 8.0B	26.1%
177	Llama 3.1 8B Instruct · 8.0B	25.9%
178	Ministral 3B 2410 · proprietary	25.3%
179	Deepseek Llm 67B Chat · 67B	24.6%
180	Mistral 7B Instruct v0.3 · 7.2B	15.2%
181	Yi 34B Chat · 34.4B	14.7%
182	Open Mistral 7B · proprietary	13.2%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

GPQA Diamond: frequently asked questions

What is the best open LLM on GPQA Diamond?: Kimi K2.6 is the top open model on GPQA Diamond, scoring 90.8%. Among all models tested — including proprietary ones — it ranks #17. The top model overall is GPT 5.4 Pro (Mar 05, 2026, xhigh) (OpenAI) at 94.6%.
What's the best GPQA Diamond model you can run on a 24 GB GPU?: Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.
What's the best GPQA Diamond model you can run on a 12 GB GPU?: Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.
Can open models match proprietary models on GPQA Diamond?: Not quite on GPQA Diamond: the strongest proprietary model (GPT 5.4 Pro (Mar 05, 2026, xhigh)) scores 94.6%, ahead of the best open model (Kimi K2.6) at 90.8% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.