Reasoning

GPQA Diamond Leaderboard

GPQA Diamond is a set of extremely hard, graduate-level science questions (physics, chemistry, biology) written by domain experts and filtered so that skilled non-experts with web access still fail. It measures genuine reasoning rather than memorization.

Source: epoch28 open models ranked+141 proprietaryData through May 2026

Open models ranked on GPQA Diamond

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 18GLM 5 · 753.9B
87.8%
2 / 27GLM 5.1 · 753.9B
85.5%
3 / 39GLM 4.7 · 358.3B
83.3%
4 / 46Qwen3 235B A22B Thinking 2507 · 235.1B
80.0%
5 / 59DeepSeek R1 0528 · 684.5B
76.3%
6 / 73Qwen3 235B A22B · 235.1B
70.7%
7 / 75DeepSeek R1 · 684.5B
69.2%
8 / 78DeepSeek v3 0324 · 684.5B
67.6%
9 / 98Phi 4 · 14.7B
56.1%
10 / 99DeepSeek R1 Distill Llama 70B · 70B
55.7%
11 / 103Llama 4 Scout 17B 16E Instruct · 108.6B
51.8%
12 / 108Qwen2.5 72B Instruct · 72.7B
49.1%
13 / 112Gemma 3 27B IT · 27.4B
48.9%
14 / 113Magistral Small 2506 · 23.6B
48.4%
15 / 117Llama 3.3 70B Instruct · 70.6B
47.4%
16 / 122Qwen2.5 32B Instruct · 32B
46.1%
17 / 125DeepSeek R1 Distill Qwen 14B · 14.8B
44.7%
18 / 126Llama 3.1 70B Instruct · 70.6B
44.2%
19 / 134Meta Llama 3 70B Instruct · 70.6B
40.6%
20 / 140Gemma 2 27B IT · 27.2B
36.5%
21 / 150Yi 1.5 34B Chat · 34.4B
32.0%
22 / 153Mixtral 8x7B Instruct v0.1 · 46.7B
30.6%
23 / 159Gemma 2 9B IT · 9.2B
27.5%
24 / 162Llama 2 70B Chat HF · 69.0B
26.3%
25 / 163Meta Llama 3 8B Instruct · 8.0B
26.1%
26 / 164Llama 3.1 8B Instruct · 8.0B
25.9%
27 / 166Deepseek Llm 67B Chat · 67B
24.6%
28 / 167Mistral 7B Instruct v0.3 · 7.2B
15.2%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

10B100Bmodel size (log scale) →87.8%15.2%GLM 5.1 · 754B · 85.5%DeepSeek R1 0528 · 685B · 76.3%Qwen3 235B A22B · 235B · 70.7%DeepSeek R1 · 685B · 69.2%DeepSeek v3 0324 · 685B · 67.6%DeepSeek R1 Distill Llama 70B · 70B · 55.7%Llama 4 Scout 17B 16E Instruct · 109B · 51.8%Qwen2.5 72B Instruct · 73B · 49.1%Gemma 3 27B IT · 27B · 48.9%Magistral Small 2506 · 24B · 48.4%Llama 3.3 70B Instruct · 71B · 47.4%Qwen2.5 32B Instruct · 32B · 46.1%DeepSeek R1 Distill Qwen 14B · 15B · 44.7%Llama 3.1 70B Instruct · 71B · 44.2%Meta Llama 3 70B Instruct · 71B · 40.6%Gemma 2 27B IT · 27B · 36.5%Yi 1.5 34B Chat · 34B · 32.0%Mixtral 8x7B Instruct v0.1 · 47B · 30.6%Llama 2 70B Chat HF · 69B · 26.3%Llama 3.1 8B Instruct · 8B · 25.9%Deepseek Llm 67B Chat · 67B · 24.6%Mistral 7B Instruct v0.3 · 7B · 15.2%Mistral 7B Instruct v…Meta Llama 3 8B Instruct · 8B · 26.1%Gemma 2 9B IT · 9B · 27.5%Gemma 2 9B ITPhi 4 · 15B · 56.1%Phi 4Qwen3 235B A22B Thinking 2507 · 235B · 80.0%GLM 4.7 · 358B · 83.3%GLM 4.7GLM 5 · 754B · 87.8%GLM 5
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Mistral 7B Instruct v0.3, 7B, score 15.2% — on the efficiency frontier (best score at its size or smaller).
  • Meta Llama 3 8B Instruct, 8B, score 26.1% — on the efficiency frontier (best score at its size or smaller).
  • Gemma 2 9B IT, 9B, score 27.5% — on the efficiency frontier (best score at its size or smaller).
  • Phi 4, 15B, score 56.1% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3 235B A22B Thinking 2507, 235B, score 80.0% — on the efficiency frontier (best score at its size or smaller).
  • GLM 4.7, 358B, score 83.3% — on the efficiency frontier (best score at its size or smaller).
  • GLM 5, 754B, score 87.8% — on the efficiency frontier (best score at its size or smaller).

GPQA Diamond: frequently asked questions

What is the best open LLM on GPQA Diamond?
GLM 5 is the top open model on GPQA Diamond, scoring 87.8%. Among all models tested — including proprietary ones — it ranks #18.
What's the best GPQA Diamond model you can run on a 24 GB GPU?
Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.
What's the best GPQA Diamond model you can run on a 12 GB GPU?
Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.
Can open models match proprietary models on GPQA Diamond?
Not quite on GPQA Diamond: the strongest proprietary model (gpt-5.4-pro-2026-03-05_xhigh) scores 94.6%, ahead of the best open model (GLM 5) at 87.8% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.