Knowledge
MMLU-Pro Leaderboard
MMLU-Pro is a harder, cleaned-up successor to MMLU with ten answer choices and more reasoning-heavy questions across 14 subjects, measuring broad knowledge and reasoning together.
Source: tigerlab97 open models ranked+163 proprietary
All models ranked on MMLU-Pro
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | Gemini-3.1-Pro · proprietary | 91.2% |
| 2 | Gemini-3-Pro(11/25) · proprietary | 90.1% |
| 3 | GPT-o1 · proprietary | 89.3% |
| 4 | Claude-4.6-Opus(Thinking) · proprietary | 89.1% |
| 5 | Gemini-3-Flash(12/25) · proprietary | 88.6% |
| 6 | MiniMax M2.1 · 228.7B | 88.0% |
| 7 | Qwen3.5-397B-A17B · proprietary | 87.8% |
| 8 | Seed2.0-Lite · proprietary | 87.7% |
| 9 | GPT-5.4 · proprietary | 87.5% |
| 10 | Claude-4.5-Sonnet(Thinking) · proprietary | 87.4% |
| 11 | GPT-5.2 · proprietary | 87.4% |
| 12 | Claude-4-Opus-Thinking · proprietary | 87.3% |
| 13 | Claude-4.5-Opus(Thinking) · proprietary | 87.3% |
| 14 | Claude-4.6-Sonnet(Thinking) · proprietary | 87.3% |
| 15 | Hunyuan-T1 · proprietary | 87.2% |
| 16 | GPT-5(high) · proprietary | 87.1% |
| 17 | K2.5-1T-A32B · proprietary | 87.1% |
| 18 | Grok-4 · proprietary | 87.0% |
| 19 | Seed-Thinking-v1.5 · proprietary | 87.0% |
| 20 | Seed2.0-Pro · proprietary | 87.0% |
| 21 | Qwen3.5 122B A10B · 125.1B | 86.7% |
| 22 | Seed1.6-Base · proprietary | 86.6% |
| 23 | Seed1.6-Thinking · proprietary | 86.6% |
| 24 | GPT-5.1 · proprietary | 86.4% |
| 25 | Seed1.6-Ada-Thinking · proprietary | 86.4% |
| 26 | Gemini-3.1-Flash-Lite-Preview · proprietary | 86.2% |
| 27 | GPT-4.5 · proprietary | 86.1% |
| 28 | Qwen3.5-27B · proprietary | 86.1% |
| 29 | Gemini-2.5-Pro · proprietary | 86.0% |
| 30 | GLM 5 · 753.9B | 86.0% |
| 31 | Qwen3-Max-Thinking · proprietary | 85.7% |
| 32 | Qwen3.5-35B-A3B · proprietary | 85.3% |
| 33 | DeepSeek-V3.2-Thinking · proprietary | 85.0% |
| 34 | GPT-o3-high · proprietary | 85.0% |
| 35 | DeepSeek-V3.1-Thinking · proprietary | 84.8% |
| 36 | GLM 4.5 · 358.3B | 84.6% |
| 37 | Gemini-2.5-Pro-Exp-03-25 · proprietary | 84.5% |
| 38 | Qwen3 235B A22B Thinking 2507 · 235.1B | 84.5% |
| 39 | Grok-4.1-Fast(Reasoning) · proprietary | 84.2% |
| 40 | Claude-3.7-Sonnet-Thinking · proprietary | 84.0% |
| 41 | DeepSeek R1 · 684.5B | 84.0% |
| 42 | Claude-4-Sonnet · proprietary | 83.7% |
| 43 | DeepSeek-V3.1-NonThinking · proprietary | 83.7% |
| 44 | Seed2.0-Mini · proprietary | 83.6% |
| 45 | Intern-S1 · proprietary | 83.5% |
| 46 | DeepSeek R1 0528 · 684.5B | 83.4% |
| 47 | GPT-4-mini (high) · proprietary | 83.0% |
| 48 | Grok-3-mini · proprietary | 83.0% |
| 49 | Qwen3 235B A22B Instruct 2507 · 235.1B | 83.0% |
| 50 | Llama4-Behemoth · proprietary | 82.8% |
| 51 | LongCat Flash Chat · 561.9B | 82.7% |
| 52 | Seed OSS 36B Instruct · 36.2B | 82.7% |
| 53 | Qwen3.5 9B · 9.7B | 82.5% |
| 54 | MiniMax M2 · 228.7B | 82.0% |
| 55 | GPT-4.1 · proprietary | 81.8% |
| 56 | GLM 4.5 Air · 110.5B | 81.4% |
| 57 | DeepSeek v3 0324 · 684.5B | 81.3% |
| 58 | MiniMax-M1 · proprietary | 81.1% |
| 59 | Kimi K2 Instruct · 1026.5B | 81.0% |
| 60 | Qwen3 30B A3B Thinking 2507 · 30.5B | 80.9% |
| 61 | GPT-oss-120B(high) · proprietary | 80.8% |
| 62 | Llama4-Maverick · proprietary | 80.5% |
| 63 | GPT-o1-mini · proprietary | 80.3% |
| 64 | Doubao-1.5-Pro · proprietary | 80.1% |
| 65 | MiniMax M2.5 · 228.7B | 80.1% |
| 66 | Grok3-Beta · proprietary | 79.9% |
| 67 | GPT-o3-mini · proprietary | 79.4% |
| 68 | Gemini-2.0-Pro · proprietary | 79.1% |
| 69 | Qwen3.5 4B · 4.7B | 79.1% |
| 70 | HunyuanTurboS · proprietary | 79.0% |
| 71 | Grok3-mini-Beta · proprietary | 78.9% |
| 72 | Qwen3-30B-A3B-Thinking · proprietary | 78.5% |
| 73 | ERNIE-4.5-300B-A47B · proprietary | 78.4% |
| 74 | Nemotron-3-Nano-30B-A3B(BF16) · proprietary | 78.3% |
| 75 | Nemotron-3-Nano-30B-A3B(FP8) · proprietary | 78.1% |
| 76 | Claude-3.5-Sonnet (2024-10-22) · proprietary | 78.0% |
| 77 | GPT-4o (2024-11-20) · proprietary | 77.9% |
| 78 | Gemini-2.0-Flash · proprietary | 77.6% |
| 79 | Gemini-2.0-Flash-exp · proprietary | 76.2% |
| 80 | Claude-3.5-Sonnet (2024-06-20) · proprietary | 76.1% |
| 81 | Qwen2.5-Max · proprietary | 76.1% |
| 82 | Phi 4 Reasoning Plus · 14.7B | 76.0% |
| 83 | DeepSeek v3 · 684.5B | 75.9% |
| 84 | MiniMax Text 01 · 456.1B | 75.7% |
| 85 | Grok-2 · proprietary | 75.5% |
| 86 | Grok-4.1-Fast(Non-Reasoning) · proprietary | 75.2% |
| 87 | GPT-4o (2024-08-06) · proprietary | 74.7% |
| 88 | Llama4-Scout · proprietary | 74.3% |
| 89 | Phi 4 Reasoning · 14.7B | 74.3% |
| 90 | GPT-oss-20B(high) · proprietary | 73.6% |
| 91 | Llama 3.1 405B Instruct · 405.9B | 73.3% |
| 92 | GPT-oss-20B(medium) · proprietary | 73.1% |
| 93 | Athene-V2-Chat (0-shot) · proprietary | 73.1% |
| 94 | GPT-4o (2024-05-13) · proprietary | 72.5% |
| 95 | Grok-2-mini · proprietary | 71.9% |
| 96 | Gemini-2.0-Flash-Lite · proprietary | 71.6% |
| 97 | Qwen2.5 72B · 72.7B | 71.6% |
| 98 | ECHO_Ego_v2_14B · proprietary | 71.2% |
| 99 | QwQ 32B Preview · 32.8B | 71.0% |
| 100 | Phi 4 · 14.7B | 70.4% |
| 101 | Gemini-1.5-Pro-002 · proprietary | 70.3% |
| 102 | Athene-V2-Chat · proprietary | 70.2% |
| 103 | ERNIE-4.5-300B-A47B-Base · proprietary | 69.5% |
| 104 | Qwen2.5 32B · 32.8B | 69.2% |
| 105 | SkyThought-T1 · proprietary | 69.2% |
| 106 | QwQ 32B · 32.8B | 69.1% |
| 107 | Gemini-1.5-Pro · proprietary | 69.0% |
| 108 | Claude-3-Opus · proprietary | 68.5% |
| 109 | Qwen3 235B A22B · 235.1B | 68.2% |
| 110 | Mistral-Large-Instruct-2411 · proprietary | 67.9% |
| 111 | Gemma 3 27B IT · 27.4B | 67.5% |
| 112 | Hunyuan-A13B · proprietary | 67.3% |
| 113 | Mistral-3.1-Small · proprietary | 66.8% |
| 114 | General-Reasoner-14B · proprietary | 66.6% |
| 115 | Mistral-Small-instruct · proprietary | 66.3% |
| 116 | Llama 3.3 70B Instruct · 70.6B | 65.9% |
| 117 | Mistral-Large-Instruct-2407 · proprietary | 65.9% |
| 118 | DeepSeek-Chat-V2_5 · proprietary | 65.8% |
| 119 | Nemotron-3-Nano-30B-A3B-Base · proprietary | 65.1% |
| 120 | Seed-OSS-36B-Base(w/ syn.) · proprietary | 65.1% |
| 121 | Reka 3 · proprietary | 65.0% |
| 122 | Qwen2 72B Instruct · 72.7B | 64.4% |
| 123 | Gemini-1.5-Flash-002 · proprietary | 64.1% |
| 124 | magnum-72b-v1 · proprietary | 63.9% |
| 125 | GPT-4-Turbo · proprietary | 63.7% |
| 126 | Qwen2.5 14B · 14.8B | 63.7% |
| 127 | DeepSeek Coder v2 Instruct · 235.7B | 63.6% |
| 128 | Higgs Llama 3 70B · 70.6B | 63.2% |
| 129 | GPT-4o-mini · proprietary | 63.1% |
| 130 | azerogpt · proprietary | 63.1% |
| 131 | Llama 3.1 70B Instruct · 70.6B | 62.8% |
| 132 | Llama 3.1 Nemotron 70B Instruct HF · 70.6B | 62.8% |
| 133 | Yi-Lightning · proprietary | 62.4% |
| 134 | Claude-3-5-Haiku-20241022 · proprietary | 62.1% |
| 135 | RRD2.5-9B · proprietary | 61.8% |
| 136 | Qwen3 30B A3B Base · 30.5B | 61.7% |
| 137 | Llama 3.1 405B · 405.9B | 61.6% |
| 138 | Gemma 3 12B IT · 12.2B | 60.6% |
| 139 | Nemotron-H-56B-Base · proprietary | 60.5% |
| 140 | Seed-OSS-36B-Base(w/o syn.) · proprietary | 60.4% |
| 141 | Reflection Llama 3.1 70B · 70B | 60.4% |
| 142 | Hunyuan-Large · proprietary | 60.2% |
| 143 | Gemini-1.5-Flash · proprietary | 59.1% |
| 144 | EXAONE-3.5-32B-Instruct · proprietary | 58.9% |
| 145 | General-Reasoner-7B · proprietary | 58.9% |
| 146 | MiMo 7B RL · 7.8B | 58.6% |
| 147 | Yi-large · proprietary | 58.1% |
| 148 | NewenAI/Phi4-sft · proprietary | 57.7% |
| 149 | Internlm3 8B Instruct · 8.8B | 57.6% |
| 150 | Claude-3-Sonnet · proprietary | 56.8% |
| 151 | ERNIE-4.5-21B-A3B-Base · proprietary | 56.7% |
| 152 | Gemma 2 27B IT · 27.2B | 56.5% |
| 153 | Mixtral-8x22B-Instruct-v0.1 · proprietary | 56.3% |
| 154 | Meta Llama 3 70B Instruct · 70.6B | 56.2% |
| 155 | Phi3-medium-4k · proprietary | 55.7% |
| 156 | Qwen2.5-Turbo · proprietary | 55.6% |
| 157 | Qwen2-72B-32k · proprietary | 55.6% |
| 158 | Qwen3.5-2B · proprietary | 55.3% |
| 159 | Deepseek-V2-Chat · proprietary | 54.8% |
| 160 | Mistral-Small-base · proprietary | 54.4% |
| 161 | Phi-4-mini · proprietary | 52.8% |
| 162 | Llama-3-70B · proprietary | 52.8% |
| 163 | Qwen1.5 72B Chat · 72.3B | 52.6% |
| 164 | Llama 3.1 70B · 70.6B | 52.5% |
| 165 | Yi 1.5 34B Chat · 34.4B | 52.3% |
| 166 | Gemma 2 9B IT · 9.2B | 52.1% |
| 167 | Phi3-medium-128k · proprietary | 51.9% |
| 168 | MAmmoTH2-8x7B-Plus · proprietary | 50.4% |
| 169 | Qwen1.5-110B · proprietary | 49.9% |
| 170 | Jamba-1.5-Large · proprietary | 49.5% |
| 171 | Mistral Small Instruct 2409 · 22.2B | 48.4% |
| 172 | GLM-4-9B-Chat · proprietary | 48.0% |
| 173 | GLM-4-9B · proprietary | 47.9% |
| 174 | Phi 3.5 Mini Instruct · 3.8B | 47.9% |
| 175 | Qwen2-7B-Instruct · proprietary | 47.2% |
| 176 | Cohere-Aya-Vision · proprietary | 47.2% |
| 177 | EXAONE-3.5-7.8B-Instruct · proprietary | 46.2% |
| 178 | Yi-1.5-9B-Chat · proprietary | 46.0% |
| 179 | Phi3-mini-4k · proprietary | 45.7% |
| 180 | Aya-Expanse-32B · proprietary | 45.4% |
| 181 | Gemma 2 9B · 9.2B | 45.1% |
| 182 | Qwen2.5 7B · 7.6B | 45.0% |
| 183 | Mistral Nemo Instruct 2407 · 12.2B | 44.8% |
| 184 | Llama 3.1 8B Instruct · 8.0B | 44.3% |
| 185 | Nemotron-H-8B-Base · proprietary | 44.0% |
| 186 | Phi3-mini-128k · proprietary | 43.9% |
| 187 | Qwen2.5 3B · 3.1B | 43.7% |
| 188 | Gemma3 4B IT · 4B | 43.6% |
| 189 | MAmmoTH2-8B-Plus · proprietary | 43.4% |
| 190 | Mixtral 8x7B Instruct v0.1 · 46.7B | 43.3% |
| 191 | Yi 34B · 34.4B | 43.0% |
| 192 | Claude-3-Haiku-20240307 · proprietary | 42.3% |
| 193 | Mathstral-7B-v0.1 · proprietary | 42.0% |
| 194 | MiMo 7B Base · 7.8B | 41.9% |
| 195 | DeepSeek Coder v2 Lite Instruct · 15.7B | 41.6% |
| 196 | Granite-3.1-8B-Instruct · proprietary | 41.0% |
| 197 | Mixtral 8x7B v0.1 · 46.7B | 41.0% |
| 198 | Llama-3-8B-Instruct · proprietary | 41.0% |
| 199 | MAmmoTH2-7B-Plus · proprietary | 40.8% |
| 200 | Qwen2-7B · proprietary | 40.7% |
| 201 | Mistral-Nemo-Base-2407 · proprietary | 39.8% |
| 202 | WizardLM 2 8x22B · 140.6B | 39.2% |
| 203 | EXAONE-3.5-2.4B-Instruct · proprietary | 39.1% |
| 204 | Yi 1.5 6B Chat · 6.1B | 38.2% |
| 205 | Qwen1.5 14B Chat · 14.2B | 38.0% |
| 206 | Ministral-8B-Instruct-2410 · proprietary | 37.9% |
| 207 | C4ai Command R V01 · 35.0B | 37.9% |
| 208 | Staring-7B · proprietary | 37.9% |
| 209 | Llama 2 70B HF · 69.0B | 37.5% |
| 210 | OpenChat-3.5-8B · proprietary | 37.2% |
| 211 | InternMath-20B-Plus · proprietary | 37.1% |
| 212 | LLaDA · proprietary | 37.0% |
| 213 | Llama3-Smaug-8B · proprietary | 36.9% |
| 214 | Llama 3.1 8B · 8.0B | 36.6% |
| 215 | Llama-3-8B · proprietary | 35.4% |
| 216 | DeepseekMath-7B-Instruct · proprietary | 35.3% |
| 217 | DeepSeek Coder v2 Lite Base · 15.7B | 34.4% |
| 218 | Aya Expanse 8B · 8.0B | 33.7% |
| 219 | Gemma 7B · 8.5B | 33.7% |
| 220 | InternMath-7B-Plus · proprietary | 33.5% |
| 221 | Granite-3.1-8B-Base · proprietary | 33.1% |
| 222 | Zephyr 7B Beta · 7.2B | 33.0% |
| 223 | Qwen2.5 1.5B · 1.5B | 32.1% |
| 224 | Granite-3.1-2B-Instruct · proprietary | 32.0% |
| 225 | Granite-3.0-8B-Base · proprietary | 31.0% |
| 226 | Mistral 7B v0.1 · 7B | 30.9% |
| 227 | Mistral 7B Instruct v0.2 · 7B | 30.8% |
| 228 | Mistral 7B v0.2 · 7.2B | 30.4% |
| 229 | Qwen3.5 0.8B · 873M | 29.7% |
| 230 | Qwen1.5 7B Chat · 7.7B | 29.1% |
| 231 | Yi 6B Chat · 6.1B | 28.8% |
| 232 | Neo-7B-Instruct · proprietary | 28.7% |
| 233 | Yi 6B · 6.1B | 26.5% |
| 234 | Neo-7B · proprietary | 25.9% |
| 235 | Mistral 7B Instruct v0.1 · 7B | 25.8% |
| 236 | Granite-3.1-3B-A800M-Instruct · proprietary | 25.4% |
| 237 | Llama 2 13B HF · 13.0B | 25.3% |
| 238 | Granite-3.1-2B-Base · proprietary | 23.9% |
| 239 | Llemma 7B · 7B | 23.4% |
| 240 | Qwen2-1.5B-Instruct · proprietary | 22.6% |
| 241 | Qwen2 1.5B · 1.5B | 22.6% |
| 242 | Llama 3.2 3B · 3.2B | 22.2% |
| 243 | Granite-3.0-2B-Base · proprietary | 21.7% |
| 244 | Granite-3.1-3B-A800M-Base · proprietary | 20.4% |
| 245 | Llama 2 7B · 7B | 20.3% |
| 246 | SmolLM2 1.7B · 1.7B | 18.3% |
| 247 | Qwen2-0.5B-Instruct · proprietary | 15.9% |
| 248 | Gemma 2B · 2.5B | 15.8% |
| 249 | Gemma 2 2B IT · 2.6B | 15.6% |
| 250 | Qwen2-0.5B · proprietary | 15.0% |
| 251 | Qwen2.5 0.5B · 494M | 14.9% |
| 252 | Gemma 3 1B IT · 1000M | 14.7% |
| 253 | Granite-3.1-1B-A400M-Instruct · proprietary | 13.3% |
| 254 | Granite 3.1 1B A400m Base · 1.3B | 12.3% |
| 255 | Llama 3.2 1B · 1.2B | 11.9% |
| 256 | SmolLM 1.7B · 1.7B | 11.9% |
| 257 | SmolLM2 360M · 362M | 11.4% |
| 258 | SmolLM 135M · 135M | 11.2% |
| 259 | SmolLM-360M · proprietary | 10.9% |
| 260 | SmolLM2 135M · 135M | 10.8% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- SmolLM 135M, 135M, score 11.2% — on the efficiency frontier (best score at its size or smaller).
- SmolLM2 360M, 362M, score 11.4% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 0.5B, 494M, score 14.9% — on the efficiency frontier (best score at its size or smaller).
- Qwen3.5 0.8B, 873M, score 29.7% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 1.5B, 2B, score 32.1% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 3B, 3B, score 43.7% — on the efficiency frontier (best score at its size or smaller).
- Phi 3.5 Mini Instruct, 4B, score 47.9% — on the efficiency frontier (best score at its size or smaller).
- Qwen3.5 4B, 5B, score 79.1% — on the efficiency frontier (best score at its size or smaller).
- Qwen3.5 9B, 10B, score 82.5% — on the efficiency frontier (best score at its size or smaller).
- Seed OSS 36B Instruct, 36B, score 82.7% — on the efficiency frontier (best score at its size or smaller).
- Qwen3.5 122B A10B, 125B, score 86.7% — on the efficiency frontier (best score at its size or smaller).
- MiniMax M2.1, 229B, score 88.0% — on the efficiency frontier (best score at its size or smaller).
MMLU-Pro: frequently asked questions
- What is the best open LLM on MMLU-Pro?
- MiniMax M2.1 is the top open model on MMLU-Pro, scoring 88.0%. Among all models tested — including proprietary ones — it ranks #6. The top model overall is Gemini-3.1-Pro (Google) at 91.2%.
- What's the best MMLU-Pro model you can run on a 24 GB GPU?
- Seed OSS 36B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 20 GB), scoring 82.7% on MMLU-Pro.
- What's the best MMLU-Pro model you can run on a 12 GB GPU?
- Qwen3.5 9B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 82.5% on MMLU-Pro.
- Can open models match proprietary models on MMLU-Pro?
- Not quite on MMLU-Pro: the strongest proprietary model (Gemini-3.1-Pro) scores 91.2%, ahead of the best open model (MiniMax M2.1) at 88.0% — but you can run the open one yourself.
Scores aggregated from tigerlab. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.