Reasoning
GPQA Diamond Leaderboard
GPQA Diamond is a set of extremely hard, graduate-level science questions (physics, chemistry, biology) written by domain experts and filtered so that skilled non-experts with web access still fail. It measures genuine reasoning rather than memorization.
Source: epoch28 open models ranked+141 proprietaryData through May 2026
Open models ranked on GPQA Diamond
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 18 | GLM 5 · 753.9B | 87.8% |
| 2 / 27 | GLM 5.1 · 753.9B | 85.5% |
| 3 / 39 | GLM 4.7 · 358.3B | 83.3% |
| 4 / 46 | Qwen3 235B A22B Thinking 2507 · 235.1B | 80.0% |
| 5 / 59 | DeepSeek R1 0528 · 684.5B | 76.3% |
| 6 / 73 | Qwen3 235B A22B · 235.1B | 70.7% |
| 7 / 75 | DeepSeek R1 · 684.5B | 69.2% |
| 8 / 78 | DeepSeek v3 0324 · 684.5B | 67.6% |
| 9 / 98 | Phi 4 · 14.7B | 56.1% |
| 10 / 99 | DeepSeek R1 Distill Llama 70B · 70B | 55.7% |
| 11 / 103 | Llama 4 Scout 17B 16E Instruct · 108.6B | 51.8% |
| 12 / 108 | Qwen2.5 72B Instruct · 72.7B | 49.1% |
| 13 / 112 | Gemma 3 27B IT · 27.4B | 48.9% |
| 14 / 113 | Magistral Small 2506 · 23.6B | 48.4% |
| 15 / 117 | Llama 3.3 70B Instruct · 70.6B | 47.4% |
| 16 / 122 | Qwen2.5 32B Instruct · 32B | 46.1% |
| 17 / 125 | DeepSeek R1 Distill Qwen 14B · 14.8B | 44.7% |
| 18 / 126 | Llama 3.1 70B Instruct · 70.6B | 44.2% |
| 19 / 134 | Meta Llama 3 70B Instruct · 70.6B | 40.6% |
| 20 / 140 | Gemma 2 27B IT · 27.2B | 36.5% |
| 21 / 150 | Yi 1.5 34B Chat · 34.4B | 32.0% |
| 22 / 153 | Mixtral 8x7B Instruct v0.1 · 46.7B | 30.6% |
| 23 / 159 | Gemma 2 9B IT · 9.2B | 27.5% |
| 24 / 162 | Llama 2 70B Chat HF · 69.0B | 26.3% |
| 25 / 163 | Meta Llama 3 8B Instruct · 8.0B | 26.1% |
| 26 / 164 | Llama 3.1 8B Instruct · 8.0B | 25.9% |
| 27 / 166 | Deepseek Llm 67B Chat · 67B | 24.6% |
| 28 / 167 | Mistral 7B Instruct v0.3 · 7.2B | 15.2% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Mistral 7B Instruct v0.3, 7B, score 15.2% — on the efficiency frontier (best score at its size or smaller).
- Meta Llama 3 8B Instruct, 8B, score 26.1% — on the efficiency frontier (best score at its size or smaller).
- Gemma 2 9B IT, 9B, score 27.5% — on the efficiency frontier (best score at its size or smaller).
- Phi 4, 15B, score 56.1% — on the efficiency frontier (best score at its size or smaller).
- Qwen3 235B A22B Thinking 2507, 235B, score 80.0% — on the efficiency frontier (best score at its size or smaller).
- GLM 4.7, 358B, score 83.3% — on the efficiency frontier (best score at its size or smaller).
- GLM 5, 754B, score 87.8% — on the efficiency frontier (best score at its size or smaller).
GPQA Diamond: frequently asked questions
- What is the best open LLM on GPQA Diamond?
- GLM 5 is the top open model on GPQA Diamond, scoring 87.8%. Among all models tested — including proprietary ones — it ranks #18.
- What's the best GPQA Diamond model you can run on a 24 GB GPU?
- Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.
- What's the best GPQA Diamond model you can run on a 12 GB GPU?
- Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.
- Can open models match proprietary models on GPQA Diamond?
- Not quite on GPQA Diamond: the strongest proprietary model (gpt-5.4-pro-2026-03-05_xhigh) scores 94.6%, ahead of the best open model (GLM 5) at 87.8% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.