Knowledge
MMLU-Pro Leaderboard
MMLU-Pro is a harder, cleaned-up successor to MMLU with ten answer choices and more reasoning-heavy questions across 14 subjects, measuring broad knowledge and reasoning together.
Source: tigerlab97 open models ranked+163 proprietary
Open models ranked on MMLU-Pro
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 6 | MiniMax M2.1 · 228.7B | 88.0% |
| 2 / 21 | Qwen3.5 122B A10B · 125.1B | 86.7% |
| 3 / 30 | GLM 5 · 753.9B | 86.0% |
| 4 / 36 | GLM 4.5 · 358.3B | 84.6% |
| 5 / 38 | Qwen3 235B A22B Thinking 2507 · 235.1B | 84.5% |
| 6 / 41 | DeepSeek R1 · 684.5B | 84.0% |
| 7 / 46 | DeepSeek R1 0528 · 684.5B | 83.4% |
| 8 / 49 | Qwen3 235B A22B Instruct 2507 · 235.1B | 83.0% |
| 9 / 51 | LongCat Flash Chat · 561.9B | 82.7% |
| 10 / 52 | Seed OSS 36B Instruct · 36.2B | 82.7% |
| 11 / 53 | Qwen3.5 9B · 9.7B | 82.5% |
| 12 / 54 | MiniMax M2 · 228.7B | 82.0% |
| 13 / 56 | GLM 4.5 Air · 110.5B | 81.4% |
| 14 / 57 | DeepSeek v3 0324 · 684.5B | 81.3% |
| 15 / 59 | Kimi K2 Instruct · 1026.5B | 81.0% |
| 16 / 60 | Qwen3 30B A3B Thinking 2507 · 30.5B | 80.9% |
| 17 / 65 | MiniMax M2.5 · 228.7B | 80.1% |
| 18 / 69 | Qwen3.5 4B · 4.7B | 79.1% |
| 19 / 82 | Phi 4 Reasoning Plus · 14.7B | 76.0% |
| 20 / 83 | DeepSeek v3 · 684.5B | 75.9% |
| 21 / 84 | MiniMax Text 01 · 456.1B | 75.7% |
| 22 / 89 | Phi 4 Reasoning · 14.7B | 74.3% |
| 23 / 91 | Llama 3.1 405B Instruct · 405.9B | 73.3% |
| 24 / 97 | Qwen2.5 72B · 72.7B | 71.6% |
| 25 / 99 | QwQ 32B Preview · 32.8B | 71.0% |
| 26 / 100 | Phi 4 · 14.7B | 70.4% |
| 27 / 104 | Qwen2.5 32B · 32.8B | 69.2% |
| 28 / 106 | QwQ 32B · 32.8B | 69.1% |
| 29 / 109 | Qwen3 235B A22B · 235.1B | 68.2% |
| 30 / 111 | Gemma 3 27B IT · 27.4B | 67.5% |
| 31 / 116 | Llama 3.3 70B Instruct · 70.6B | 65.9% |
| 32 / 122 | Qwen2 72B Instruct · 72.7B | 64.4% |
| 33 / 126 | Qwen2.5 14B · 14.8B | 63.7% |
| 34 / 127 | DeepSeek Coder v2 Instruct · 235.7B | 63.6% |
| 35 / 128 | Higgs Llama 3 70B · 70.6B | 63.2% |
| 36 / 131 | Llama 3.1 70B Instruct · 70.6B | 62.8% |
| 37 / 132 | Llama 3.1 Nemotron 70B Instruct HF · 70.6B | 62.8% |
| 38 / 136 | Qwen3 30B A3B Base · 30.5B | 61.7% |
| 39 / 137 | Llama 3.1 405B · 405.9B | 61.6% |
| 40 / 138 | Gemma 3 12B IT · 12.2B | 60.6% |
| 41 / 141 | Reflection Llama 3.1 70B · 70B | 60.4% |
| 42 / 146 | MiMo 7B RL · 7.8B | 58.6% |
| 43 / 149 | Internlm3 8B Instruct · 8.8B | 57.6% |
| 44 / 152 | Gemma 2 27B IT · 27.2B | 56.5% |
| 45 / 154 | Meta Llama 3 70B Instruct · 70.6B | 56.2% |
| 46 / 163 | Qwen1.5 72B Chat · 72.3B | 52.6% |
| 47 / 164 | Llama 3.1 70B · 70.6B | 52.5% |
| 48 / 165 | Yi 1.5 34B Chat · 34.4B | 52.3% |
| 49 / 166 | Gemma 2 9B IT · 9.2B | 52.1% |
| 50 / 171 | Mistral Small Instruct 2409 · 22.2B | 48.4% |
| 51 / 174 | Phi 3.5 Mini Instruct · 3.8B | 47.9% |
| 52 / 181 | Gemma 2 9B · 9.2B | 45.1% |
| 53 / 182 | Qwen2.5 7B · 7.6B | 45.0% |
| 54 / 183 | Mistral Nemo Instruct 2407 · 12.2B | 44.8% |
| 55 / 184 | Llama 3.1 8B Instruct · 8.0B | 44.3% |
| 56 / 187 | Qwen2.5 3B · 3.1B | 43.7% |
| 57 / 188 | Gemma3 4B IT · 4B | 43.6% |
| 58 / 190 | Mixtral 8x7B Instruct v0.1 · 46.7B | 43.3% |
| 59 / 191 | Yi 34B · 34.4B | 43.0% |
| 60 / 194 | MiMo 7B Base · 7.8B | 41.9% |
| 61 / 195 | DeepSeek Coder v2 Lite Instruct · 15.7B | 41.6% |
| 62 / 197 | Mixtral 8x7B v0.1 · 46.7B | 41.0% |
| 63 / 202 | WizardLM 2 8x22B · 140.6B | 39.2% |
| 64 / 204 | Yi 1.5 6B Chat · 6.1B | 38.2% |
| 65 / 205 | Qwen1.5 14B Chat · 14.2B | 38.0% |
| 66 / 207 | C4ai Command R V01 · 35.0B | 37.9% |
| 67 / 209 | Llama 2 70B HF · 69.0B | 37.5% |
| 68 / 214 | Llama 3.1 8B · 8.0B | 36.6% |
| 69 / 217 | DeepSeek Coder v2 Lite Base · 15.7B | 34.4% |
| 70 / 218 | Aya Expanse 8B · 8.0B | 33.7% |
| 71 / 219 | Gemma 7B · 8.5B | 33.7% |
| 72 / 222 | Zephyr 7B Beta · 7.2B | 33.0% |
| 73 / 223 | Qwen2.5 1.5B · 1.5B | 32.1% |
| 74 / 226 | Mistral 7B v0.1 · 7B | 30.9% |
| 75 / 227 | Mistral 7B Instruct v0.2 · 7B | 30.8% |
| 76 / 228 | Mistral 7B v0.2 · 7.2B | 30.4% |
| 77 / 229 | Qwen3.5 0.8B · 873M | 29.7% |
| 78 / 230 | Qwen1.5 7B Chat · 7.7B | 29.1% |
| 79 / 231 | Yi 6B Chat · 6.1B | 28.8% |
| 80 / 233 | Yi 6B · 6.1B | 26.5% |
| 81 / 235 | Mistral 7B Instruct v0.1 · 7B | 25.8% |
| 82 / 237 | Llama 2 13B HF · 13.0B | 25.3% |
| 83 / 239 | Llemma 7B · 7B | 23.4% |
| 84 / 241 | Qwen2 1.5B · 1.5B | 22.6% |
| 85 / 242 | Llama 3.2 3B · 3.2B | 22.2% |
| 86 / 245 | Llama 2 7B · 7B | 20.3% |
| 87 / 246 | SmolLM2 1.7B · 1.7B | 18.3% |
| 88 / 248 | Gemma 2B · 2.5B | 15.8% |
| 89 / 249 | Gemma 2 2B IT · 2.6B | 15.6% |
| 90 / 251 | Qwen2.5 0.5B · 494M | 14.9% |
| 91 / 252 | Gemma 3 1B IT · 1000M | 14.7% |
| 92 / 254 | Granite 3.1 1B A400m Base · 1.3B | 12.3% |
| 93 / 255 | Llama 3.2 1B · 1.2B | 11.9% |
| 94 / 256 | SmolLM 1.7B · 1.7B | 11.9% |
| 95 / 257 | SmolLM2 360M · 362M | 11.4% |
| 96 / 258 | SmolLM 135M · 135M | 11.2% |
| 97 / 260 | SmolLM2 135M · 135M | 10.8% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- SmolLM 135M, 135M, score 11.2% — on the efficiency frontier (best score at its size or smaller).
- SmolLM2 360M, 362M, score 11.4% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 0.5B, 494M, score 14.9% — on the efficiency frontier (best score at its size or smaller).
- Qwen3.5 0.8B, 873M, score 29.7% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 1.5B, 2B, score 32.1% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 3B, 3B, score 43.7% — on the efficiency frontier (best score at its size or smaller).
- Phi 3.5 Mini Instruct, 4B, score 47.9% — on the efficiency frontier (best score at its size or smaller).
- Qwen3.5 4B, 5B, score 79.1% — on the efficiency frontier (best score at its size or smaller).
- Qwen3.5 9B, 10B, score 82.5% — on the efficiency frontier (best score at its size or smaller).
- Seed OSS 36B Instruct, 36B, score 82.7% — on the efficiency frontier (best score at its size or smaller).
- Qwen3.5 122B A10B, 125B, score 86.7% — on the efficiency frontier (best score at its size or smaller).
- MiniMax M2.1, 229B, score 88.0% — on the efficiency frontier (best score at its size or smaller).
MMLU-Pro: frequently asked questions
- What is the best open LLM on MMLU-Pro?
- MiniMax M2.1 is the top open model on MMLU-Pro, scoring 88.0%. Among all models tested — including proprietary ones — it ranks #6. The top model overall is Gemini-3.1-Pro (Google) at 91.2%.
- What's the best MMLU-Pro model you can run on a 24 GB GPU?
- Seed OSS 36B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 20 GB), scoring 82.7% on MMLU-Pro.
- What's the best MMLU-Pro model you can run on a 12 GB GPU?
- Qwen3.5 9B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 82.5% on MMLU-Pro.
- Can open models match proprietary models on MMLU-Pro?
- Not quite on MMLU-Pro: the strongest proprietary model (Gemini-3.1-Pro) scores 91.2%, ahead of the best open model (MiniMax M2.1) at 88.0% — but you can run the open one yourself.
Scores aggregated from tigerlab. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.