Reasoning
GPQA Diamond Leaderboard
GPQA Diamond is a set of extremely hard, graduate-level science questions (physics, chemistry, biology) written by domain experts and filtered so that skilled non-experts with web access still fail. It measures genuine reasoning rather than memorization.
Source: epoch28 open models ranked+141 proprietaryData through May 2026
All models ranked on GPQA Diamond
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.4-pro-2026-03-05_xhigh · proprietary | 94.6% |
| 2 | gemini-3.1-pro-preview · proprietary | 94.1% |
| 3 | gpt-5.5-pre-release_xhigh · proprietary | 94.0% |
| 4 | gpt-5.5-pro-pre-release_xhigh · proprietary | 93.9% |
| 5 | gpt-5.4-2026-03-05_xhigh · proprietary | 93.3% |
| 6 | gemini-3.5-flash_high · proprietary | 92.8% |
| 7 | gemini-3-pro-preview · proprietary | 92.6% |
| 8 | gpt-5.2-2025-12-11_xhigh · proprietary | 91.4% |
| 9 | kimi-k2.6 · proprietary | 90.8% |
| 10 | gpt-5.5_low · proprietary | 90.7% |
| 11 | claude-opus-4-6_32K · proprietary | 90.5% |
| 12 | claude-opus-4-7_xhigh · proprietary | 90.1% |
| 13 | muse-spark · proprietary | 89.8% |
| 14 | qwen3.6-max-preview · proprietary | 89.1% |
| 15 | claude-opus-4-6_64K · proprietary | 88.8% |
| 16 | gpt-5.2-2025-12-11_high · proprietary | 88.2% |
| 17 | gpt-5.2-2025-12-11_medium · proprietary | 87.9% |
| 18 | GLM 5 · 753.9B | 87.8% |
| 19 | gpt-5.1-2025-11-13_high · proprietary | 87.6% |
| 20 | fireworks/kimi-k2p5 · proprietary | 87.6% |
| 21 | claude-sonnet-4-6_32K · proprietary | 87.4% |
| 22 | qwen3.6-plus · proprietary | 87.4% |
| 23 | grok-4-0709 · proprietary | 87.0% |
| 24 | gpt-5-2025-08-07_high · proprietary | 86.2% |
| 25 | claude-opus-4-5-20251101_32K · proprietary | 86.1% |
| 26 | claude-opus-4-5-20251101_16K · proprietary | 85.5% |
| 27 | GLM 5.1 · 753.9B | 85.5% |
| 28 | gpt-5-2025-08-07_medium · proprietary | 85.4% |
| 29 | gemini-2.5-pro · proprietary | 85.3% |
| 30 | gpt-5.1-2025-11-13_medium · proprietary | 85.0% |
| 31 | gemini-2.5-pro-preview-06-05 · proprietary | 84.9% |
| 32 | qwen3.6-flash · proprietary | 84.4% |
| 33 | kimi-k2-thinking-turbo · proprietary | 84.2% |
| 34 | qwen3.5-plus · proprietary | 84.2% |
| 35 | gemini-2.5-pro-exp-03-25 · proprietary | 83.8% |
| 36 | qwen3.5-flash · proprietary | 83.8% |
| 37 | gpt-5.4-mini-2026-03-17_high · proprietary | 83.6% |
| 38 | deepseek-reasoner · proprietary | 83.4% |
| 39 | GLM 4.7 · 358.3B | 83.3% |
| 40 | gemini-3-flash-preview · proprietary | 83.2% |
| 41 | gpt-5.2-2025-12-11_low · proprietary | 82.7% |
| 42 | claude-sonnet-4-5-20250929_59K · proprietary | 82.3% |
| 43 | o3-2025-04-16_high · proprietary | 81.8% |
| 44 | claude-sonnet-4-5-20250929_32K · proprietary | 81.7% |
| 45 | claude-opus-4-5-20251101 · proprietary | 80.7% |
| 46 | Qwen3 235B A22B Thinking 2507 · 235.1B | 80.0% |
| 47 | o4-mini-2025-04-16_high · proprietary | 79.6% |
| 48 | claude-sonnet-4-5-20250929_16K · proprietary | 78.8% |
| 49 | claude-3-7-sonnet-20250219_64K · proprietary | 78.5% |
| 50 | gpt-5.4-nano-2026-03-17_high · proprietary | 78.5% |
| 51 | claude-sonnet-4-20250514_32K · proprietary | 78.3% |
| 52 | claude-sonnet-4-20250514_59K · proprietary | 77.8% |
| 53 | claude-opus-4-1-20250805_16K · proprietary | 77.3% |
| 54 | o3-mini-2025-01-31_high · proprietary | 77.0% |
| 55 | claude-3-7-sonnet-20250219_16K · proprietary | 76.8% |
| 56 | claude-3-7-sonnet-20250219_32K · proprietary | 76.8% |
| 57 | claude-opus-4-1-20250805_27K · proprietary | 76.8% |
| 58 | o1-2024-12-17_high · proprietary | 76.8% |
| 59 | DeepSeek R1 0528 · 684.5B | 76.3% |
| 60 | claude-opus-4-20250514_16K · proprietary | 76.3% |
| 61 | grok-3-mini-beta_low · proprietary | 76.3% |
| 62 | claude-sonnet-4-20250514_16K · proprietary | 75.8% |
| 63 | o1-2024-12-17_medium · proprietary | 75.8% |
| 64 | openai/gpt-oss-120b_high · proprietary | 75.8% |
| 65 | gpt-5-mini-2025-08-07_high · proprietary | 75.0% |
| 66 | grok-3-mini-beta_high · proprietary | 74.6% |
| 67 | o3-mini-2025-01-31_medium · proprietary | 74.3% |
| 68 | claude-sonnet-4-5-20250929 · proprietary | 73.7% |
| 69 | claude-opus-4-1-20250805 · proprietary | 73.2% |
| 70 | qwen3-max-2025-09-23 · proprietary | 72.6% |
| 71 | gpt-5-mini-2025-08-07_medium · proprietary | 71.7% |
| 72 | claude-haiku-4-5-20251001_32K · proprietary | 71.2% |
| 73 | Qwen3 235B A22B · 235.1B | 70.7% |
| 74 | gpt-5-nano-2025-08-07_high · proprietary | 69.4% |
| 75 | DeepSeek R1 · 684.5B | 69.2% |
| 76 | claude-opus-4-20250514 · proprietary | 69.2% |
| 77 | gpt-4.5-preview-2025-02-27 · proprietary | 68.7% |
| 78 | DeepSeek v3 0324 · 684.5B | 67.6% |
| 79 | grok-3-beta · proprietary | 67.6% |
| 80 | gpt-5-nano-2025-08-07_medium · proprietary | 67.4% |
| 81 | Llama-4-Maverick-17B-128E-Instruct-FP8 · proprietary | 67.0% |
| 82 | gpt-4.1-2025-04-14 · proprietary | 66.9% |
| 83 | claude-sonnet-4-20250514 · proprietary | 66.7% |
| 84 | gemini-2.5-pro-preview-05-06 · proprietary | 66.7% |
| 85 | claude-3-7-sonnet-20250219 · proprietary | 66.0% |
| 86 | gpt-4.1-mini-2025-04-14 · proprietary | 65.8% |
| 87 | gemini-2.0-pro-exp-02-05 · proprietary | 65.7% |
| 88 | qwq-plus · proprietary | 65.4% |
| 89 | gemini-2.0-flash-001 · proprietary | 64.1% |
| 90 | o1-mini-2024-09-12_high · proprietary | 62.4% |
| 91 | claude-haiku-4-5-20251001 · proprietary | 60.5% |
| 92 | mistral-medium-2505 · proprietary | 59.5% |
| 93 | o1-mini-2024-09-12_medium · proprietary | 59.5% |
| 94 | gemini-1.5-pro-002 · proprietary | 57.2% |
| 95 | gemini-2.0-flash-thinking-exp-01-21 · proprietary | 57.1% |
| 96 | DeepSeek-V3 · proprietary | 56.5% |
| 97 | qwen-max-2025-01-25 · proprietary | 56.1% |
| 98 | Phi 4 · 14.7B | 56.1% |
| 99 | DeepSeek R1 Distill Llama 70B · 70B | 55.7% |
| 100 | claude-3-5-sonnet-20241022 · proprietary | 55.3% |
| 101 | claude-3-5-sonnet-20240620 · proprietary | 54.0% |
| 102 | grok-2-1212 · proprietary | 53.8% |
| 103 | Llama 4 Scout 17B 16E Instruct · 108.6B | 51.8% |
| 104 | mistral-large-2411 · proprietary | 51.3% |
| 105 | Llama-3.1-405B-Instruct · proprietary | 50.9% |
| 106 | o1-preview-2024-09-12 · proprietary | 50.3% |
| 107 | gpt-4o-2024-08-06 · proprietary | 49.2% |
| 108 | Qwen2.5 72B Instruct · 72.7B | 49.1% |
| 109 | mistral-large-2407 · proprietary | 49.0% |
| 110 | gpt-4.1-nano-2025-04-14 · proprietary | 48.9% |
| 111 | gpt-4o-2024-05-13 · proprietary | 48.9% |
| 112 | Gemma 3 27B IT · 27.4B | 48.9% |
| 113 | Magistral Small 2506 · 23.6B | 48.4% |
| 114 | qwen-plus-2025-01-25 · proprietary | 48.1% |
| 115 | gpt-4o-2024-11-20 · proprietary | 47.9% |
| 116 | mistral-small-2503 · proprietary | 47.5% |
| 117 | Llama 3.3 70B Instruct · 70.6B | 47.4% |
| 118 | gemini-1.5-flash-002 · proprietary | 47.3% |
| 119 | claude-3-opus-20240229 · proprietary | 47.2% |
| 120 | gpt-4-turbo-2024-04-09 · proprietary | 46.6% |
| 121 | Llama-3.1-Tulu-3-70B-DPO · proprietary | 46.3% |
| 122 | Qwen2.5 32B Instruct · 32B | 46.1% |
| 123 | gemini-1.5-pro-001 · proprietary | 45.9% |
| 124 | mistral-small-2501 · proprietary | 45.3% |
| 125 | DeepSeek R1 Distill Qwen 14B · 14.8B | 44.7% |
| 126 | Llama 3.1 70B Instruct · 70.6B | 44.2% |
| 127 | WizardLM-2-8x22B · proprietary | 43.4% |
| 128 | gpt-4-1106-preview · proprietary | 42.4% |
| 129 | gpt-4-0125-preview · proprietary | 42.3% |
| 130 | qwen-turbo-2024-11-01 · proprietary | 41.8% |
| 131 | Llama-3.2-90B-Vision-Instruct · proprietary | 41.0% |
| 132 | qwen2-72b-instruct · proprietary | 40.8% |
| 133 | claude-3-sonnet-20240229 · proprietary | 40.6% |
| 134 | Meta Llama 3 70B Instruct · 70.6B | 40.6% |
| 135 | gemini-1.5-flash-001 · proprietary | 40.4% |
| 136 | mistral-large-2402 · proprietary | 38.8% |
| 137 | claude-3-5-haiku-20241022 · proprietary | 38.1% |
| 138 | gpt-4o-mini-2024-07-18 · proprietary | 37.7% |
| 139 | Hermes-2-Theta-Llama-3-70B · proprietary | 37.5% |
| 140 | Gemma 2 27B IT · 27.2B | 36.5% |
| 141 | claude-3-haiku-20240307 · proprietary | 36.3% |
| 142 | gpt-4-0314 · proprietary | 35.7% |
| 143 | claude-2.0 · proprietary | 34.7% |
| 144 | open-mixtral-8x22b · proprietary | 34.1% |
| 145 | gemini-1.0-pro-001 · proprietary | 34.0% |
| 146 | Eurus-2-7B-PRIME · proprietary | 33.9% |
| 147 | claude-2.1 · proprietary | 33.0% |
| 148 | gemini-1.5-flash-8b-001 · proprietary | 33.0% |
| 149 | dbrx-instruct · proprietary | 32.9% |
| 150 | Yi 1.5 34B Chat · 34.4B | 32.0% |
| 151 | qwen1.5-32b-chat · proprietary | 30.7% |
| 152 | gpt-4-0613 · proprietary | 30.6% |
| 153 | Mixtral 8x7B Instruct v0.1 · 46.7B | 30.6% |
| 154 | open-mistral-nemo-2407 · proprietary | 29.9% |
| 155 | open-mixtral-8x7b · proprietary | 29.8% |
| 156 | qwen1.5-72b-chat · proprietary | 28.8% |
| 157 | gpt-3.5-turbo-1106 · proprietary | 28.0% |
| 158 | Phi-3-medium-128k-instruct · proprietary | 27.6% |
| 159 | Gemma 2 9B IT · 9.2B | 27.5% |
| 160 | gpt-3.5-turbo-0125 · proprietary | 27.2% |
| 161 | ministral-8b-2410 · proprietary | 27.2% |
| 162 | Llama 2 70B Chat HF · 69.0B | 26.3% |
| 163 | Meta Llama 3 8B Instruct · 8.0B | 26.1% |
| 164 | Llama 3.1 8B Instruct · 8.0B | 25.9% |
| 165 | ministral-3b-2410 · proprietary | 25.3% |
| 166 | Deepseek Llm 67B Chat · 67B | 24.6% |
| 167 | Mistral 7B Instruct v0.3 · 7.2B | 15.2% |
| 168 | Yi-34B-Chat · proprietary | 14.7% |
| 169 | open-mistral-7b · proprietary | 13.2% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Mistral 7B Instruct v0.3, 7B, score 15.2% — on the efficiency frontier (best score at its size or smaller).
- Meta Llama 3 8B Instruct, 8B, score 26.1% — on the efficiency frontier (best score at its size or smaller).
- Gemma 2 9B IT, 9B, score 27.5% — on the efficiency frontier (best score at its size or smaller).
- Phi 4, 15B, score 56.1% — on the efficiency frontier (best score at its size or smaller).
- Qwen3 235B A22B Thinking 2507, 235B, score 80.0% — on the efficiency frontier (best score at its size or smaller).
- GLM 4.7, 358B, score 83.3% — on the efficiency frontier (best score at its size or smaller).
- GLM 5, 754B, score 87.8% — on the efficiency frontier (best score at its size or smaller).
GPQA Diamond: frequently asked questions
- What is the best open LLM on GPQA Diamond?
- GLM 5 is the top open model on GPQA Diamond, scoring 87.8%. Among all models tested — including proprietary ones — it ranks #18.
- What's the best GPQA Diamond model you can run on a 24 GB GPU?
- Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.
- What's the best GPQA Diamond model you can run on a 12 GB GPU?
- Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 56.1% on GPQA Diamond.
- Can open models match proprietary models on GPQA Diamond?
- Not quite on GPQA Diamond: the strongest proprietary model (gpt-5.4-pro-2026-03-05_xhigh) scores 94.6%, ahead of the best open model (GLM 5) at 87.8% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.