Knowledge

MMLU-Pro Leaderboard

MMLU-Pro is a harder, cleaned-up successor to MMLU with ten answer choices and more reasoning-heavy questions across 14 subjects, measuring broad knowledge and reasoning together.

Source: tigerlab97 open models ranked+163 proprietary

Open models ranked on MMLU-Pro

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 6MiniMax M2.1 · 228.7B
88.0%
2 / 21Qwen3.5 122B A10B · 125.1B
86.7%
3 / 30GLM 5 · 753.9B
86.0%
4 / 36GLM 4.5 · 358.3B
84.6%
5 / 38Qwen3 235B A22B Thinking 2507 · 235.1B
84.5%
6 / 41DeepSeek R1 · 684.5B
84.0%
7 / 46DeepSeek R1 0528 · 684.5B
83.4%
8 / 49Qwen3 235B A22B Instruct 2507 · 235.1B
83.0%
9 / 51LongCat Flash Chat · 561.9B
82.7%
10 / 52Seed OSS 36B Instruct · 36.2B
82.7%
11 / 53Qwen3.5 9B · 9.7B
82.5%
12 / 54MiniMax M2 · 228.7B
82.0%
13 / 56GLM 4.5 Air · 110.5B
81.4%
14 / 57DeepSeek v3 0324 · 684.5B
81.3%
15 / 59Kimi K2 Instruct · 1026.5B
81.0%
16 / 60Qwen3 30B A3B Thinking 2507 · 30.5B
80.9%
17 / 65MiniMax M2.5 · 228.7B
80.1%
18 / 69Qwen3.5 4B · 4.7B
79.1%
19 / 82Phi 4 Reasoning Plus · 14.7B
76.0%
20 / 83DeepSeek v3 · 684.5B
75.9%
21 / 84MiniMax Text 01 · 456.1B
75.7%
22 / 89Phi 4 Reasoning · 14.7B
74.3%
23 / 91Llama 3.1 405B Instruct · 405.9B
73.3%
24 / 97Qwen2.5 72B · 72.7B
71.6%
25 / 99QwQ 32B Preview · 32.8B
71.0%
26 / 100Phi 4 · 14.7B
70.4%
27 / 104Qwen2.5 32B · 32.8B
69.2%
28 / 106QwQ 32B · 32.8B
69.1%
29 / 109Qwen3 235B A22B · 235.1B
68.2%
30 / 111Gemma 3 27B IT · 27.4B
67.5%
31 / 116Llama 3.3 70B Instruct · 70.6B
65.9%
32 / 122Qwen2 72B Instruct · 72.7B
64.4%
33 / 126Qwen2.5 14B · 14.8B
63.7%
34 / 127DeepSeek Coder v2 Instruct · 235.7B
63.6%
35 / 128Higgs Llama 3 70B · 70.6B
63.2%
36 / 131Llama 3.1 70B Instruct · 70.6B
62.8%
37 / 132Llama 3.1 Nemotron 70B Instruct HF · 70.6B
62.8%
38 / 136Qwen3 30B A3B Base · 30.5B
61.7%
39 / 137Llama 3.1 405B · 405.9B
61.6%
40 / 138Gemma 3 12B IT · 12.2B
60.6%
41 / 141Reflection Llama 3.1 70B · 70B
60.4%
42 / 146MiMo 7B RL · 7.8B
58.6%
43 / 149Internlm3 8B Instruct · 8.8B
57.6%
44 / 152Gemma 2 27B IT · 27.2B
56.5%
45 / 154Meta Llama 3 70B Instruct · 70.6B
56.2%
46 / 163Qwen1.5 72B Chat · 72.3B
52.6%
47 / 164Llama 3.1 70B · 70.6B
52.5%
48 / 165Yi 1.5 34B Chat · 34.4B
52.3%
49 / 166Gemma 2 9B IT · 9.2B
52.1%
50 / 171Mistral Small Instruct 2409 · 22.2B
48.4%
51 / 174Phi 3.5 Mini Instruct · 3.8B
47.9%
52 / 181Gemma 2 9B · 9.2B
45.1%
53 / 182Qwen2.5 7B · 7.6B
45.0%
54 / 183Mistral Nemo Instruct 2407 · 12.2B
44.8%
55 / 184Llama 3.1 8B Instruct · 8.0B
44.3%
56 / 187Qwen2.5 3B · 3.1B
43.7%
57 / 188Gemma3 4B IT · 4B
43.6%
58 / 190Mixtral 8x7B Instruct v0.1 · 46.7B
43.3%
59 / 191Yi 34B · 34.4B
43.0%
60 / 194MiMo 7B Base · 7.8B
41.9%
61 / 195DeepSeek Coder v2 Lite Instruct · 15.7B
41.6%
62 / 197Mixtral 8x7B v0.1 · 46.7B
41.0%
63 / 202WizardLM 2 8x22B · 140.6B
39.2%
64 / 204Yi 1.5 6B Chat · 6.1B
38.2%
65 / 205Qwen1.5 14B Chat · 14.2B
38.0%
66 / 207C4ai Command R V01 · 35.0B
37.9%
67 / 209Llama 2 70B HF · 69.0B
37.5%
68 / 214Llama 3.1 8B · 8.0B
36.6%
69 / 217DeepSeek Coder v2 Lite Base · 15.7B
34.4%
70 / 218Aya Expanse 8B · 8.0B
33.7%
71 / 219Gemma 7B · 8.5B
33.7%
72 / 222Zephyr 7B Beta · 7.2B
33.0%
73 / 223Qwen2.5 1.5B · 1.5B
32.1%
74 / 226Mistral 7B v0.1 · 7B
30.9%
75 / 227Mistral 7B Instruct v0.2 · 7B
30.8%
76 / 228Mistral 7B v0.2 · 7.2B
30.4%
77 / 229Qwen3.5 0.8B · 873M
29.7%
78 / 230Qwen1.5 7B Chat · 7.7B
29.1%
79 / 231Yi 6B Chat · 6.1B
28.8%
80 / 233Yi 6B · 6.1B
26.5%
81 / 235Mistral 7B Instruct v0.1 · 7B
25.8%
82 / 237Llama 2 13B HF · 13.0B
25.3%
83 / 239Llemma 7B · 7B
23.4%
84 / 241Qwen2 1.5B · 1.5B
22.6%
85 / 242Llama 3.2 3B · 3.2B
22.2%
86 / 245Llama 2 7B · 7B
20.3%
87 / 246SmolLM2 1.7B · 1.7B
18.3%
88 / 248Gemma 2B · 2.5B
15.8%
89 / 249Gemma 2 2B IT · 2.6B
15.6%
90 / 251Qwen2.5 0.5B · 494M
14.9%
91 / 252Gemma 3 1B IT · 1000M
14.7%
92 / 254Granite 3.1 1B A400m Base · 1.3B
12.3%
93 / 255Llama 3.2 1B · 1.2B
11.9%
94 / 256SmolLM 1.7B · 1.7B
11.9%
95 / 257SmolLM2 360M · 362M
11.4%
96 / 258SmolLM 135M · 135M
11.2%
97 / 260SmolLM2 135M · 135M
10.8%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

1B10B100B1Tmodel size (log scale) →88.0%10.8%GLM 5 · 754B · 86.0%GLM 4.5 · 358B · 84.6%Qwen3 235B A22B Thinking 2507 · 235B · 84.5%DeepSeek R1 · 685B · 84.0%DeepSeek R1 0528 · 685B · 83.4%Qwen3 235B A22B Instruct 2507 · 235B · 83.0%LongCat Flash Chat · 562B · 82.7%MiniMax M2 · 229B · 82.0%GLM 4.5 Air · 110B · 81.4%DeepSeek v3 0324 · 685B · 81.3%Kimi K2 Instruct · 1T · 81.0%Qwen3 30B A3B Thinking 2507 · 31B · 80.9%MiniMax M2.5 · 229B · 80.1%Phi 4 Reasoning Plus · 15B · 76.0%DeepSeek v3 · 685B · 75.9%MiniMax Text 01 · 456B · 75.7%Phi 4 Reasoning · 15B · 74.3%Llama 3.1 405B Instruct · 406B · 73.3%Qwen2.5 72B · 73B · 71.6%QwQ 32B Preview · 33B · 71.0%Phi 4 · 15B · 70.4%Qwen2.5 32B · 33B · 69.2%QwQ 32B · 33B · 69.1%Qwen3 235B A22B · 235B · 68.2%Gemma 3 27B IT · 27B · 67.5%Llama 3.3 70B Instruct · 71B · 65.9%Qwen2 72B Instruct · 73B · 64.4%Qwen2.5 14B · 15B · 63.7%DeepSeek Coder v2 Instruct · 236B · 63.6%Higgs Llama 3 70B · 71B · 63.2%Llama 3.1 70B Instruct · 71B · 62.8%Llama 3.1 Nemotron 70B Instruct HF · 71B · 62.8%Qwen3 30B A3B Base · 31B · 61.7%Llama 3.1 405B · 406B · 61.6%Gemma 3 12B IT · 12B · 60.6%Reflection Llama 3.1 70B · 70B · 60.4%MiMo 7B RL · 8B · 58.6%Internlm3 8B Instruct · 9B · 57.6%Gemma 2 27B IT · 27B · 56.5%Meta Llama 3 70B Instruct · 71B · 56.2%Qwen1.5 72B Chat · 72B · 52.6%Llama 3.1 70B · 71B · 52.5%Yi 1.5 34B Chat · 34B · 52.3%Gemma 2 9B IT · 9B · 52.1%Mistral Small Instruct 2409 · 22B · 48.4%Gemma 2 9B · 9B · 45.1%Qwen2.5 7B · 8B · 45.0%Mistral Nemo Instruct 2407 · 12B · 44.8%Llama 3.1 8B Instruct · 8B · 44.3%Gemma3 4B IT · 4B · 43.6%Mixtral 8x7B Instruct v0.1 · 47B · 43.3%Yi 34B · 34B · 43.0%MiMo 7B Base · 8B · 41.9%DeepSeek Coder v2 Lite Instruct · 16B · 41.6%Mixtral 8x7B v0.1 · 47B · 41.0%WizardLM 2 8x22B · 141B · 39.2%Yi 1.5 6B Chat · 6B · 38.2%Qwen1.5 14B Chat · 14B · 38.0%C4ai Command R V01 · 35B · 37.9%Llama 2 70B HF · 69B · 37.5%Llama 3.1 8B · 8B · 36.6%DeepSeek Coder v2 Lite Base · 16B · 34.4%Aya Expanse 8B · 8B · 33.7%Gemma 7B · 9B · 33.7%Zephyr 7B Beta · 7B · 33.0%Mistral 7B v0.1 · 7B · 30.9%Mistral 7B Instruct v0.2 · 7B · 30.8%Mistral 7B v0.2 · 7B · 30.4%Qwen1.5 7B Chat · 8B · 29.1%Yi 6B Chat · 6B · 28.8%Yi 6B · 6B · 26.5%Mistral 7B Instruct v0.1 · 7B · 25.8%Llama 2 13B HF · 13B · 25.3%Llemma 7B · 7B · 23.4%Qwen2 1.5B · 2B · 22.6%Llama 3.2 3B · 3B · 22.2%Llama 2 7B · 7B · 20.3%SmolLM2 1.7B · 2B · 18.3%Gemma 2B · 3B · 15.8%Gemma 2 2B IT · 3B · 15.6%Gemma 3 1B IT · 1000M · 14.7%Granite 3.1 1B A400m Base · 1B · 12.3%Llama 3.2 1B · 1B · 11.9%SmolLM 1.7B · 2B · 11.9%SmolLM2 135M · 135M · 10.8%SmolLM 135M · 135M · 11.2%SmolLM 135MSmolLM2 360M · 362M · 11.4%SmolLM2 360MQwen2.5 0.5B · 494M · 14.9%Qwen2.5 0.5BQwen3.5 0.8B · 873M · 29.7%Qwen3.5 0.8BQwen2.5 1.5B · 2B · 32.1%Qwen2.5 1.5BQwen2.5 3B · 3B · 43.7%Qwen2.5 3BPhi 3.5 Mini Instruct · 4B · 47.9%Phi 3.5 Mini InstructQwen3.5 4B · 5B · 79.1%Qwen3.5 4BQwen3.5 9B · 10B · 82.5%Qwen3.5 9BSeed OSS 36B Instruct · 36B · 82.7%Qwen3.5 122B A10B · 125B · 86.7%Qwen3.5 122B A10BMiniMax M2.1 · 229B · 88.0%MiniMax M2.1
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • SmolLM 135M, 135M, score 11.2% — on the efficiency frontier (best score at its size or smaller).
  • SmolLM2 360M, 362M, score 11.4% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 0.5B, 494M, score 14.9% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3.5 0.8B, 873M, score 29.7% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 1.5B, 2B, score 32.1% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 3B, 3B, score 43.7% — on the efficiency frontier (best score at its size or smaller).
  • Phi 3.5 Mini Instruct, 4B, score 47.9% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3.5 4B, 5B, score 79.1% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3.5 9B, 10B, score 82.5% — on the efficiency frontier (best score at its size or smaller).
  • Seed OSS 36B Instruct, 36B, score 82.7% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3.5 122B A10B, 125B, score 86.7% — on the efficiency frontier (best score at its size or smaller).
  • MiniMax M2.1, 229B, score 88.0% — on the efficiency frontier (best score at its size or smaller).

MMLU-Pro: frequently asked questions

What is the best open LLM on MMLU-Pro?
MiniMax M2.1 is the top open model on MMLU-Pro, scoring 88.0%. Among all models tested — including proprietary ones — it ranks #6. The top model overall is Gemini-3.1-Pro (Google) at 91.2%.
What's the best MMLU-Pro model you can run on a 24 GB GPU?
Seed OSS 36B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 20 GB), scoring 82.7% on MMLU-Pro.
What's the best MMLU-Pro model you can run on a 12 GB GPU?
Qwen3.5 9B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 82.5% on MMLU-Pro.
Can open models match proprietary models on MMLU-Pro?
Not quite on MMLU-Pro: the strongest proprietary model (Gemini-3.1-Pro) scores 91.2%, ahead of the best open model (MiniMax M2.1) at 88.0% — but you can run the open one yourself.

Scores aggregated from tigerlab. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.