Reasoning

GPQA Diamond Leaderboard

GPQA Diamond is a set of extremely hard, graduate-level science questions (physics, chemistry, biology) written by domain experts and filtered so that skilled non-experts with web access still fail. It measures genuine reasoning rather than memorization.

Source: epoch28 open models ranked+141 proprietaryData through May 2026

All models ranked on GPQA Diamond

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gpt-5.4-pro-2026-03-05_xhigh · proprietary
94.6%
2gemini-3.1-pro-preview · proprietary
94.1%
3gpt-5.5-pre-release_xhigh · proprietary
94.0%
4gpt-5.5-pro-pre-release_xhigh · proprietary
93.9%
5gpt-5.4-2026-03-05_xhigh · proprietary
93.3%
6gemini-3.5-flash_high · proprietary
92.8%
7gemini-3-pro-preview · proprietary
92.6%
8gpt-5.2-2025-12-11_xhigh · proprietary
91.4%
9kimi-k2.6 · proprietary
90.8%
10gpt-5.5_low · proprietary
90.7%
11claude-opus-4-6_32K · proprietary
90.5%
12claude-opus-4-7_xhigh · proprietary
90.1%
13muse-spark · proprietary
89.8%
14qwen3.6-max-preview · proprietary
89.1%
15claude-opus-4-6_64K · proprietary
88.8%
16gpt-5.2-2025-12-11_high · proprietary
88.2%
17gpt-5.2-2025-12-11_medium · proprietary
87.9%
18GLM 5 · 753.9B
87.8%
19gpt-5.1-2025-11-13_high · proprietary
87.6%
20fireworks/kimi-k2p5 · proprietary
87.6%
21claude-sonnet-4-6_32K · proprietary
87.4%
22qwen3.6-plus · proprietary
87.4%
23grok-4-0709 · proprietary
87.0%
24gpt-5-2025-08-07_high · proprietary
86.2%
25claude-opus-4-5-20251101_32K · proprietary
86.1%
26claude-opus-4-5-20251101_16K · proprietary
85.5%
27GLM 5.1 · 753.9B
85.5%
28gpt-5-2025-08-07_medium · proprietary
85.4%
29gemini-2.5-pro · proprietary
85.3%
30gpt-5.1-2025-11-13_medium · proprietary
85.0%
31gemini-2.5-pro-preview-06-05 · proprietary
84.9%
32qwen3.6-flash · proprietary
84.4%
33kimi-k2-thinking-turbo · proprietary
84.2%
34qwen3.5-plus · proprietary
84.2%
35gemini-2.5-pro-exp-03-25 · proprietary
83.8%
36qwen3.5-flash · proprietary
83.8%
37gpt-5.4-mini-2026-03-17_high · proprietary
83.6%
38deepseek-reasoner · proprietary
83.4%
39GLM 4.7 · 358.3B
83.3%
40gemini-3-flash-preview · proprietary
83.2%
41gpt-5.2-2025-12-11_low · proprietary
82.7%
42claude-sonnet-4-5-20250929_59K · proprietary
82.3%
43o3-2025-04-16_high · proprietary
81.8%
44claude-sonnet-4-5-20250929_32K · proprietary
81.7%
45claude-opus-4-5-20251101 · proprietary
80.7%
46Qwen3 235B A22B Thinking 2507 · 235.1B
80.0%
47o4-mini-2025-04-16_high · proprietary
79.6%
48claude-sonnet-4-5-20250929_16K · proprietary
78.8%
49claude-3-7-sonnet-20250219_64K · proprietary
78.5%
50gpt-5.4-nano-2026-03-17_high · proprietary
78.5%
51claude-sonnet-4-20250514_32K · proprietary
78.3%
52claude-sonnet-4-20250514_59K · proprietary
77.8%
53claude-opus-4-1-20250805_16K · proprietary
77.3%
54o3-mini-2025-01-31_high · proprietary
77.0%
55claude-3-7-sonnet-20250219_16K · proprietary
76.8%
56claude-3-7-sonnet-20250219_32K · proprietary
76.8%
57claude-opus-4-1-20250805_27K · proprietary
76.8%
58o1-2024-12-17_high · proprietary
76.8%
59DeepSeek R1 0528 · 684.5B
76.3%
60claude-opus-4-20250514_16K · proprietary
76.3%
61grok-3-mini-beta_low · proprietary
76.3%
62claude-sonnet-4-20250514_16K · proprietary
75.8%
63o1-2024-12-17_medium · proprietary
75.8%
64openai/gpt-oss-120b_high · proprietary
75.8%
65gpt-5-mini-2025-08-07_high · proprietary
75.0%
66grok-3-mini-beta_high · proprietary
74.6%
67o3-mini-2025-01-31_medium · proprietary
74.3%
68claude-sonnet-4-5-20250929 · proprietary
73.7%
69claude-opus-4-1-20250805 · proprietary
73.2%
70qwen3-max-2025-09-23 · proprietary
72.6%
71gpt-5-mini-2025-08-07_medium · proprietary
71.7%
72claude-haiku-4-5-20251001_32K · proprietary
71.2%
73Qwen3 235B A22B · 235.1B
70.7%
74gpt-5-nano-2025-08-07_high · proprietary
69.4%
75DeepSeek R1 · 684.5B
69.2%
76claude-opus-4-20250514 · proprietary
69.2%
77gpt-4.5-preview-2025-02-27 · proprietary
68.7%
78DeepSeek v3 0324 · 684.5B
67.6%
79grok-3-beta · proprietary
67.6%
80gpt-5-nano-2025-08-07_medium · proprietary
67.4%
81Llama-4-Maverick-17B-128E-Instruct-FP8 · proprietary
67.0%
82gpt-4.1-2025-04-14 · proprietary
66.9%
83claude-sonnet-4-20250514 · proprietary
66.7%
84gemini-2.5-pro-preview-05-06 · proprietary
66.7%
85claude-3-7-sonnet-20250219 · proprietary
66.0%
86gpt-4.1-mini-2025-04-14 · proprietary
65.8%
87gemini-2.0-pro-exp-02-05 · proprietary
65.7%
88qwq-plus · proprietary
65.4%
89gemini-2.0-flash-001 · proprietary
64.1%
90o1-mini-2024-09-12_high · proprietary
62.4%
91claude-haiku-4-5-20251001 · proprietary
60.5%
92mistral-medium-2505 · proprietary
59.5%
93o1-mini-2024-09-12_medium · proprietary
59.5%
94gemini-1.5-pro-002 · proprietary
57.2%
95gemini-2.0-flash-thinking-exp-01-21 · proprietary
57.1%
96DeepSeek-V3 · proprietary
56.5%
97qwen-max-2025-01-25 · proprietary
56.1%
98Phi 4 · 14.7B
56.1%
99DeepSeek R1 Distill Llama 70B · 70B
55.7%
100claude-3-5-sonnet-20241022 · proprietary
55.3%
101claude-3-5-sonnet-20240620 · proprietary
54.0%
102grok-2-1212 · proprietary
53.8%
103Llama 4 Scout 17B 16E Instruct · 108.6B
51.8%
104mistral-large-2411 · proprietary
51.3%
105Llama-3.1-405B-Instruct · proprietary
50.9%
106o1-preview-2024-09-12 · proprietary
50.3%
107gpt-4o-2024-08-06 · proprietary
49.2%
108Qwen2.5 72B Instruct · 72.7B
49.1%
109mistral-large-2407 · proprietary
49.0%
110gpt-4.1-nano-2025-04-14 · proprietary
48.9%
111gpt-4o-2024-05-13 · proprietary
48.9%
112Gemma 3 27B IT · 27.4B
48.9%
113Magistral Small 2506 · 23.6B
48.4%
114qwen-plus-2025-01-25 · proprietary
48.1%
115gpt-4o-2024-11-20 · proprietary
47.9%
116mistral-small-2503 · proprietary
47.5%
117Llama 3.3 70B Instruct · 70.6B
47.4%
118gemini-1.5-flash-002 · proprietary
47.3%
119claude-3-opus-20240229 · proprietary
47.2%
120gpt-4-turbo-2024-04-09 · proprietary
46.6%
121Llama-3.1-Tulu-3-70B-DPO · proprietary
46.3%
122Qwen2.5 32B Instruct · 32B
46.1%
123gemini-1.5-pro-001 · proprietary
45.9%
124mistral-small-2501 · proprietary
45.3%
125DeepSeek R1 Distill Qwen 14B · 14.8B
44.7%
126Llama 3.1 70B Instruct · 70.6B
44.2%
127WizardLM-2-8x22B · proprietary
43.4%
128gpt-4-1106-preview · proprietary
42.4%
129gpt-4-0125-preview · proprietary
42.3%
130qwen-turbo-2024-11-01 · proprietary
41.8%
131Llama-3.2-90B-Vision-Instruct · proprietary
41.0%
132qwen2-72b-instruct · proprietary
40.8%
133claude-3-sonnet-20240229 · proprietary
40.6%
134Meta Llama 3 70B Instruct · 70.6B
40.6%
135gemini-1.5-flash-001 · proprietary
40.4%
136mistral-large-2402 · proprietary
38.8%
137claude-3-5-haiku-20241022 · proprietary
38.1%
138gpt-4o-mini-2024-07-18 · proprietary
37.7%
139Hermes-2-Theta-Llama-3-70B · proprietary
37.5%
140Gemma 2 27B IT · 27.2B
36.5%
141claude-3-haiku-20240307 · proprietary
36.3%
142gpt-4-0314 · proprietary
35.7%
143claude-2.0 · proprietary
34.7%
144open-mixtral-8x22b · proprietary
34.1%
145gemini-1.0-pro-001 · proprietary
34.0%
146Eurus-2-7B-PRIME · proprietary
33.9%
147claude-2.1 · proprietary
33.0%
148gemini-1.5-flash-8b-001 · proprietary
33.0%
149dbrx-instruct · proprietary
32.9%
150Yi 1.5 34B Chat · 34.4B
32.0%
151qwen1.5-32b-chat · proprietary
30.7%
152gpt-4-0613 · proprietary
30.6%
153Mixtral 8x7B Instruct v0.1 · 46.7B
30.6%
154open-mistral-nemo-2407 · proprietary
29.9%
155open-mixtral-8x7b · proprietary
29.8%
156qwen1.5-72b-chat · proprietary
28.8%
157gpt-3.5-turbo-1106 · proprietary
28.0%
158Phi-3-medium-128k-instruct · proprietary
27.6%
159Gemma 2 9B IT · 9.2B
27.5%
160gpt-3.5-turbo-0125 · proprietary
27.2%
161ministral-8b-2410 · proprietary
27.2%
162Llama 2 70B Chat HF · 69.0B
26.3%
163Meta Llama 3 8B Instruct · 8.0B
26.1%
164Llama 3.1 8B Instruct · 8.0B
25.9%
165ministral-3b-2410 · proprietary
25.3%
166Deepseek Llm 67B Chat · 67B
24.6%
167Mistral 7B Instruct v0.3 · 7.2B
15.2%
168Yi-34B-Chat · proprietary
14.7%
169open-mistral-7b · proprietary
13.2%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

10B100Bmodel size (log scale) →87.8%15.2%GLM 5.1 · 754B · 85.5%DeepSeek R1 0528 · 685B · 76.3%Qwen3 235B A22B · 235B · 70.7%DeepSeek R1 · 685B · 69.2%DeepSeek v3 0324 · 685B · 67.6%DeepSeek R1 Distill Llama 70B · 70B · 55.7%Llama 4 Scout 17B 16E Instruct · 109B · 51.8%Qwen2.5 72B Instruct · 73B · 49.1%Gemma 3 27B IT · 27B · 48.9%Magistral Small 2506 · 24B · 48.4%Llama 3.3 70B Instruct · 71B · 47.4%Qwen2.5 32B Instruct · 32B · 46.1%DeepSeek R1 Distill Qwen 14B · 15B · 44.7%Llama 3.1 70B Instruct · 71B · 44.2%Meta Llama 3 70B Instruct · 71B · 40.6%Gemma 2 27B IT · 27B · 36.5%Yi 1.5 34B Chat · 34B · 32.0%Mixtral 8x7B Instruct v0.1 · 47B · 30.6%Llama 2 70B Chat HF · 69B · 26.3%Llama 3.1 8B Instruct · 8B · 25.9%Deepseek Llm 67B Chat · 67B · 24.6%Mistral 7B Instruct v0.3 · 7B · 15.2%Mistral 7B Instruct v…Meta Llama 3 8B Instruct · 8B · 26.1%Gemma 2 9B IT · 9B · 27.5%Gemma 2 9B ITPhi 4 · 15B · 56.1%Phi 4Qwen3 235B A22B Thinking 2507 · 235B · 80.0%GLM 4.7 · 358B · 83.3%GLM 4.7GLM 5 · 754B · 87.8%GLM 5
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Mistral 7B Instruct v0.3, 7B, score 15.2% — on the efficiency frontier (best score at its size or smaller).
  • Meta Llama 3 8B Instruct, 8B, score 26.1% — on the efficiency frontier (best score at its size or smaller).
  • Gemma 2 9B IT, 9B, score 27.5% — on the efficiency frontier (best score at its size or smaller).
  • Phi 4, 15B, score 56.1% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3 235B A22B Thinking 2507, 235B, score 80.0% — on the efficiency frontier (best score at its size or smaller).
  • GLM 4.7, 358B, score 83.3% — on the efficiency frontier (best score at its size or smaller).
  • GLM 5, 754B, score 87.8% — on the efficiency frontier (best score at its size or smaller).

GPQA Diamond: frequently asked questions

What is the best open LLM on GPQA Diamond?
GLM 5 is the top open model on GPQA Diamond, scoring 87.8%. Among all models tested — including proprietary ones — it ranks #18.
What's the best GPQA Diamond model you can run on a 24 GB GPU?
Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.
What's the best GPQA Diamond model you can run on a 12 GB GPU?
Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.
Can open models match proprietary models on GPQA Diamond?
Not quite on GPQA Diamond: the strongest proprietary model (gpt-5.4-pro-2026-03-05_xhigh) scores 94.6%, ahead of the best open model (GLM 5) at 87.8% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.