Knowledge

MMLU-Pro Leaderboard

MMLU-Pro is a harder, cleaned-up successor to MMLU with ten answer choices and more reasoning-heavy questions across 14 subjects, measuring broad knowledge and reasoning together.

Source: tigerlab97 open models ranked+163 proprietary

All models ranked on MMLU-Pro

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1Gemini-3.1-Pro · proprietary
91.2%
2Gemini-3-Pro(11/25) · proprietary
90.1%
3GPT-o1 · proprietary
89.3%
4Claude-4.6-Opus(Thinking) · proprietary
89.1%
5Gemini-3-Flash(12/25) · proprietary
88.6%
6MiniMax M2.1 · 228.7B
88.0%
7Qwen3.5-397B-A17B · proprietary
87.8%
8Seed2.0-Lite · proprietary
87.7%
9GPT-5.4 · proprietary
87.5%
10Claude-4.5-Sonnet(Thinking) · proprietary
87.4%
11GPT-5.2 · proprietary
87.4%
12Claude-4-Opus-Thinking · proprietary
87.3%
13Claude-4.5-Opus(Thinking) · proprietary
87.3%
14Claude-4.6-Sonnet(Thinking) · proprietary
87.3%
15Hunyuan-T1 · proprietary
87.2%
16GPT-5(high) · proprietary
87.1%
17K2.5-1T-A32B · proprietary
87.1%
18Grok-4 · proprietary
87.0%
19Seed-Thinking-v1.5 · proprietary
87.0%
20Seed2.0-Pro · proprietary
87.0%
21Qwen3.5 122B A10B · 125.1B
86.7%
22Seed1.6-Base · proprietary
86.6%
23Seed1.6-Thinking · proprietary
86.6%
24GPT-5.1 · proprietary
86.4%
25Seed1.6-Ada-Thinking · proprietary
86.4%
26Gemini-3.1-Flash-Lite-Preview · proprietary
86.2%
27GPT-4.5 · proprietary
86.1%
28Qwen3.5-27B · proprietary
86.1%
29Gemini-2.5-Pro · proprietary
86.0%
30GLM 5 · 753.9B
86.0%
31Qwen3-Max-Thinking · proprietary
85.7%
32Qwen3.5-35B-A3B · proprietary
85.3%
33DeepSeek-V3.2-Thinking · proprietary
85.0%
34GPT-o3-high · proprietary
85.0%
35DeepSeek-V3.1-Thinking · proprietary
84.8%
36GLM 4.5 · 358.3B
84.6%
37Gemini-2.5-Pro-Exp-03-25 · proprietary
84.5%
38Qwen3 235B A22B Thinking 2507 · 235.1B
84.5%
39Grok-4.1-Fast(Reasoning) · proprietary
84.2%
40Claude-3.7-Sonnet-Thinking · proprietary
84.0%
41DeepSeek R1 · 684.5B
84.0%
42Claude-4-Sonnet · proprietary
83.7%
43DeepSeek-V3.1-NonThinking · proprietary
83.7%
44Seed2.0-Mini · proprietary
83.6%
45Intern-S1 · proprietary
83.5%
46DeepSeek R1 0528 · 684.5B
83.4%
47GPT-4-mini (high) · proprietary
83.0%
48Grok-3-mini · proprietary
83.0%
49Qwen3 235B A22B Instruct 2507 · 235.1B
83.0%
50Llama4-Behemoth · proprietary
82.8%
51LongCat Flash Chat · 561.9B
82.7%
52Seed OSS 36B Instruct · 36.2B
82.7%
53Qwen3.5 9B · 9.7B
82.5%
54MiniMax M2 · 228.7B
82.0%
55GPT-4.1 · proprietary
81.8%
56GLM 4.5 Air · 110.5B
81.4%
57DeepSeek v3 0324 · 684.5B
81.3%
58MiniMax-M1 · proprietary
81.1%
59Kimi K2 Instruct · 1026.5B
81.0%
60Qwen3 30B A3B Thinking 2507 · 30.5B
80.9%
61GPT-oss-120B(high) · proprietary
80.8%
62Llama4-Maverick · proprietary
80.5%
63GPT-o1-mini · proprietary
80.3%
64Doubao-1.5-Pro · proprietary
80.1%
65MiniMax M2.5 · 228.7B
80.1%
66Grok3-Beta · proprietary
79.9%
67GPT-o3-mini · proprietary
79.4%
68Gemini-2.0-Pro · proprietary
79.1%
69Qwen3.5 4B · 4.7B
79.1%
70HunyuanTurboS · proprietary
79.0%
71Grok3-mini-Beta · proprietary
78.9%
72Qwen3-30B-A3B-Thinking · proprietary
78.5%
73ERNIE-4.5-300B-A47B · proprietary
78.4%
74Nemotron-3-Nano-30B-A3B(BF16) · proprietary
78.3%
75Nemotron-3-Nano-30B-A3B(FP8) · proprietary
78.1%
76Claude-3.5-Sonnet (2024-10-22) · proprietary
78.0%
77GPT-4o (2024-11-20) · proprietary
77.9%
78Gemini-2.0-Flash · proprietary
77.6%
79Gemini-2.0-Flash-exp · proprietary
76.2%
80Claude-3.5-Sonnet (2024-06-20) · proprietary
76.1%
81Qwen2.5-Max · proprietary
76.1%
82Phi 4 Reasoning Plus · 14.7B
76.0%
83DeepSeek v3 · 684.5B
75.9%
84MiniMax Text 01 · 456.1B
75.7%
85Grok-2 · proprietary
75.5%
86Grok-4.1-Fast(Non-Reasoning) · proprietary
75.2%
87GPT-4o (2024-08-06) · proprietary
74.7%
88Llama4-Scout · proprietary
74.3%
89Phi 4 Reasoning · 14.7B
74.3%
90GPT-oss-20B(high) · proprietary
73.6%
91Llama 3.1 405B Instruct · 405.9B
73.3%
92GPT-oss-20B(medium) · proprietary
73.1%
93Athene-V2-Chat (0-shot) · proprietary
73.1%
94GPT-4o (2024-05-13) · proprietary
72.5%
95Grok-2-mini · proprietary
71.9%
96Gemini-2.0-Flash-Lite · proprietary
71.6%
97Qwen2.5 72B · 72.7B
71.6%
98ECHO_Ego_v2_14B · proprietary
71.2%
99QwQ 32B Preview · 32.8B
71.0%
100Phi 4 · 14.7B
70.4%
101Gemini-1.5-Pro-002 · proprietary
70.3%
102Athene-V2-Chat · proprietary
70.2%
103ERNIE-4.5-300B-A47B-Base · proprietary
69.5%
104Qwen2.5 32B · 32.8B
69.2%
105SkyThought-T1 · proprietary
69.2%
106QwQ 32B · 32.8B
69.1%
107Gemini-1.5-Pro · proprietary
69.0%
108Claude-3-Opus · proprietary
68.5%
109Qwen3 235B A22B · 235.1B
68.2%
110Mistral-Large-Instruct-2411 · proprietary
67.9%
111Gemma 3 27B IT · 27.4B
67.5%
112Hunyuan-A13B · proprietary
67.3%
113Mistral-3.1-Small · proprietary
66.8%
114General-Reasoner-14B · proprietary
66.6%
115Mistral-Small-instruct · proprietary
66.3%
116Llama 3.3 70B Instruct · 70.6B
65.9%
117Mistral-Large-Instruct-2407 · proprietary
65.9%
118DeepSeek-Chat-V2_5 · proprietary
65.8%
119Nemotron-3-Nano-30B-A3B-Base · proprietary
65.1%
120Seed-OSS-36B-Base(w/ syn.) · proprietary
65.1%
121Reka 3 · proprietary
65.0%
122Qwen2 72B Instruct · 72.7B
64.4%
123Gemini-1.5-Flash-002 · proprietary
64.1%
124magnum-72b-v1 · proprietary
63.9%
125GPT-4-Turbo · proprietary
63.7%
126Qwen2.5 14B · 14.8B
63.7%
127DeepSeek Coder v2 Instruct · 235.7B
63.6%
128Higgs Llama 3 70B · 70.6B
63.2%
129GPT-4o-mini · proprietary
63.1%
130azerogpt · proprietary
63.1%
131Llama 3.1 70B Instruct · 70.6B
62.8%
132Llama 3.1 Nemotron 70B Instruct HF · 70.6B
62.8%
133Yi-Lightning · proprietary
62.4%
134Claude-3-5-Haiku-20241022 · proprietary
62.1%
135RRD2.5-9B · proprietary
61.8%
136Qwen3 30B A3B Base · 30.5B
61.7%
137Llama 3.1 405B · 405.9B
61.6%
138Gemma 3 12B IT · 12.2B
60.6%
139Nemotron-H-56B-Base · proprietary
60.5%
140Seed-OSS-36B-Base(w/o syn.) · proprietary
60.4%
141Reflection Llama 3.1 70B · 70B
60.4%
142Hunyuan-Large · proprietary
60.2%
143Gemini-1.5-Flash · proprietary
59.1%
144EXAONE-3.5-32B-Instruct · proprietary
58.9%
145General-Reasoner-7B · proprietary
58.9%
146MiMo 7B RL · 7.8B
58.6%
147Yi-large · proprietary
58.1%
148NewenAI/Phi4-sft · proprietary
57.7%
149Internlm3 8B Instruct · 8.8B
57.6%
150Claude-3-Sonnet · proprietary
56.8%
151ERNIE-4.5-21B-A3B-Base · proprietary
56.7%
152Gemma 2 27B IT · 27.2B
56.5%
153Mixtral-8x22B-Instruct-v0.1 · proprietary
56.3%
154Meta Llama 3 70B Instruct · 70.6B
56.2%
155Phi3-medium-4k · proprietary
55.7%
156Qwen2.5-Turbo · proprietary
55.6%
157Qwen2-72B-32k · proprietary
55.6%
158Qwen3.5-2B · proprietary
55.3%
159Deepseek-V2-Chat · proprietary
54.8%
160Mistral-Small-base · proprietary
54.4%
161Phi-4-mini · proprietary
52.8%
162Llama-3-70B · proprietary
52.8%
163Qwen1.5 72B Chat · 72.3B
52.6%
164Llama 3.1 70B · 70.6B
52.5%
165Yi 1.5 34B Chat · 34.4B
52.3%
166Gemma 2 9B IT · 9.2B
52.1%
167Phi3-medium-128k · proprietary
51.9%
168MAmmoTH2-8x7B-Plus · proprietary
50.4%
169Qwen1.5-110B · proprietary
49.9%
170Jamba-1.5-Large · proprietary
49.5%
171Mistral Small Instruct 2409 · 22.2B
48.4%
172GLM-4-9B-Chat · proprietary
48.0%
173GLM-4-9B · proprietary
47.9%
174Phi 3.5 Mini Instruct · 3.8B
47.9%
175Qwen2-7B-Instruct · proprietary
47.2%
176Cohere-Aya-Vision · proprietary
47.2%
177EXAONE-3.5-7.8B-Instruct · proprietary
46.2%
178Yi-1.5-9B-Chat · proprietary
46.0%
179Phi3-mini-4k · proprietary
45.7%
180Aya-Expanse-32B · proprietary
45.4%
181Gemma 2 9B · 9.2B
45.1%
182Qwen2.5 7B · 7.6B
45.0%
183Mistral Nemo Instruct 2407 · 12.2B
44.8%
184Llama 3.1 8B Instruct · 8.0B
44.3%
185Nemotron-H-8B-Base · proprietary
44.0%
186Phi3-mini-128k · proprietary
43.9%
187Qwen2.5 3B · 3.1B
43.7%
188Gemma3 4B IT · 4B
43.6%
189MAmmoTH2-8B-Plus · proprietary
43.4%
190Mixtral 8x7B Instruct v0.1 · 46.7B
43.3%
191Yi 34B · 34.4B
43.0%
192Claude-3-Haiku-20240307 · proprietary
42.3%
193Mathstral-7B-v0.1 · proprietary
42.0%
194MiMo 7B Base · 7.8B
41.9%
195DeepSeek Coder v2 Lite Instruct · 15.7B
41.6%
196Granite-3.1-8B-Instruct · proprietary
41.0%
197Mixtral 8x7B v0.1 · 46.7B
41.0%
198Llama-3-8B-Instruct · proprietary
41.0%
199MAmmoTH2-7B-Plus · proprietary
40.8%
200Qwen2-7B · proprietary
40.7%
201Mistral-Nemo-Base-2407 · proprietary
39.8%
202WizardLM 2 8x22B · 140.6B
39.2%
203EXAONE-3.5-2.4B-Instruct · proprietary
39.1%
204Yi 1.5 6B Chat · 6.1B
38.2%
205Qwen1.5 14B Chat · 14.2B
38.0%
206Ministral-8B-Instruct-2410 · proprietary
37.9%
207C4ai Command R V01 · 35.0B
37.9%
208Staring-7B · proprietary
37.9%
209Llama 2 70B HF · 69.0B
37.5%
210OpenChat-3.5-8B · proprietary
37.2%
211InternMath-20B-Plus · proprietary
37.1%
212LLaDA · proprietary
37.0%
213Llama3-Smaug-8B · proprietary
36.9%
214Llama 3.1 8B · 8.0B
36.6%
215Llama-3-8B · proprietary
35.4%
216DeepseekMath-7B-Instruct · proprietary
35.3%
217DeepSeek Coder v2 Lite Base · 15.7B
34.4%
218Aya Expanse 8B · 8.0B
33.7%
219Gemma 7B · 8.5B
33.7%
220InternMath-7B-Plus · proprietary
33.5%
221Granite-3.1-8B-Base · proprietary
33.1%
222Zephyr 7B Beta · 7.2B
33.0%
223Qwen2.5 1.5B · 1.5B
32.1%
224Granite-3.1-2B-Instruct · proprietary
32.0%
225Granite-3.0-8B-Base · proprietary
31.0%
226Mistral 7B v0.1 · 7B
30.9%
227Mistral 7B Instruct v0.2 · 7B
30.8%
228Mistral 7B v0.2 · 7.2B
30.4%
229Qwen3.5 0.8B · 873M
29.7%
230Qwen1.5 7B Chat · 7.7B
29.1%
231Yi 6B Chat · 6.1B
28.8%
232Neo-7B-Instruct · proprietary
28.7%
233Yi 6B · 6.1B
26.5%
234Neo-7B · proprietary
25.9%
235Mistral 7B Instruct v0.1 · 7B
25.8%
236Granite-3.1-3B-A800M-Instruct · proprietary
25.4%
237Llama 2 13B HF · 13.0B
25.3%
238Granite-3.1-2B-Base · proprietary
23.9%
239Llemma 7B · 7B
23.4%
240Qwen2-1.5B-Instruct · proprietary
22.6%
241Qwen2 1.5B · 1.5B
22.6%
242Llama 3.2 3B · 3.2B
22.2%
243Granite-3.0-2B-Base · proprietary
21.7%
244Granite-3.1-3B-A800M-Base · proprietary
20.4%
245Llama 2 7B · 7B
20.3%
246SmolLM2 1.7B · 1.7B
18.3%
247Qwen2-0.5B-Instruct · proprietary
15.9%
248Gemma 2B · 2.5B
15.8%
249Gemma 2 2B IT · 2.6B
15.6%
250Qwen2-0.5B · proprietary
15.0%
251Qwen2.5 0.5B · 494M
14.9%
252Gemma 3 1B IT · 1000M
14.7%
253Granite-3.1-1B-A400M-Instruct · proprietary
13.3%
254Granite 3.1 1B A400m Base · 1.3B
12.3%
255Llama 3.2 1B · 1.2B
11.9%
256SmolLM 1.7B · 1.7B
11.9%
257SmolLM2 360M · 362M
11.4%
258SmolLM 135M · 135M
11.2%
259SmolLM-360M · proprietary
10.9%
260SmolLM2 135M · 135M
10.8%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

1B10B100B1Tmodel size (log scale) →88.0%10.8%GLM 5 · 754B · 86.0%GLM 4.5 · 358B · 84.6%Qwen3 235B A22B Thinking 2507 · 235B · 84.5%DeepSeek R1 · 685B · 84.0%DeepSeek R1 0528 · 685B · 83.4%Qwen3 235B A22B Instruct 2507 · 235B · 83.0%LongCat Flash Chat · 562B · 82.7%MiniMax M2 · 229B · 82.0%GLM 4.5 Air · 110B · 81.4%DeepSeek v3 0324 · 685B · 81.3%Kimi K2 Instruct · 1T · 81.0%Qwen3 30B A3B Thinking 2507 · 31B · 80.9%MiniMax M2.5 · 229B · 80.1%Phi 4 Reasoning Plus · 15B · 76.0%DeepSeek v3 · 685B · 75.9%MiniMax Text 01 · 456B · 75.7%Phi 4 Reasoning · 15B · 74.3%Llama 3.1 405B Instruct · 406B · 73.3%Qwen2.5 72B · 73B · 71.6%QwQ 32B Preview · 33B · 71.0%Phi 4 · 15B · 70.4%Qwen2.5 32B · 33B · 69.2%QwQ 32B · 33B · 69.1%Qwen3 235B A22B · 235B · 68.2%Gemma 3 27B IT · 27B · 67.5%Llama 3.3 70B Instruct · 71B · 65.9%Qwen2 72B Instruct · 73B · 64.4%Qwen2.5 14B · 15B · 63.7%DeepSeek Coder v2 Instruct · 236B · 63.6%Higgs Llama 3 70B · 71B · 63.2%Llama 3.1 70B Instruct · 71B · 62.8%Llama 3.1 Nemotron 70B Instruct HF · 71B · 62.8%Qwen3 30B A3B Base · 31B · 61.7%Llama 3.1 405B · 406B · 61.6%Gemma 3 12B IT · 12B · 60.6%Reflection Llama 3.1 70B · 70B · 60.4%MiMo 7B RL · 8B · 58.6%Internlm3 8B Instruct · 9B · 57.6%Gemma 2 27B IT · 27B · 56.5%Meta Llama 3 70B Instruct · 71B · 56.2%Qwen1.5 72B Chat · 72B · 52.6%Llama 3.1 70B · 71B · 52.5%Yi 1.5 34B Chat · 34B · 52.3%Gemma 2 9B IT · 9B · 52.1%Mistral Small Instruct 2409 · 22B · 48.4%Gemma 2 9B · 9B · 45.1%Qwen2.5 7B · 8B · 45.0%Mistral Nemo Instruct 2407 · 12B · 44.8%Llama 3.1 8B Instruct · 8B · 44.3%Gemma3 4B IT · 4B · 43.6%Mixtral 8x7B Instruct v0.1 · 47B · 43.3%Yi 34B · 34B · 43.0%MiMo 7B Base · 8B · 41.9%DeepSeek Coder v2 Lite Instruct · 16B · 41.6%Mixtral 8x7B v0.1 · 47B · 41.0%WizardLM 2 8x22B · 141B · 39.2%Yi 1.5 6B Chat · 6B · 38.2%Qwen1.5 14B Chat · 14B · 38.0%C4ai Command R V01 · 35B · 37.9%Llama 2 70B HF · 69B · 37.5%Llama 3.1 8B · 8B · 36.6%DeepSeek Coder v2 Lite Base · 16B · 34.4%Aya Expanse 8B · 8B · 33.7%Gemma 7B · 9B · 33.7%Zephyr 7B Beta · 7B · 33.0%Mistral 7B v0.1 · 7B · 30.9%Mistral 7B Instruct v0.2 · 7B · 30.8%Mistral 7B v0.2 · 7B · 30.4%Qwen1.5 7B Chat · 8B · 29.1%Yi 6B Chat · 6B · 28.8%Yi 6B · 6B · 26.5%Mistral 7B Instruct v0.1 · 7B · 25.8%Llama 2 13B HF · 13B · 25.3%Llemma 7B · 7B · 23.4%Qwen2 1.5B · 2B · 22.6%Llama 3.2 3B · 3B · 22.2%Llama 2 7B · 7B · 20.3%SmolLM2 1.7B · 2B · 18.3%Gemma 2B · 3B · 15.8%Gemma 2 2B IT · 3B · 15.6%Gemma 3 1B IT · 1000M · 14.7%Granite 3.1 1B A400m Base · 1B · 12.3%Llama 3.2 1B · 1B · 11.9%SmolLM 1.7B · 2B · 11.9%SmolLM2 135M · 135M · 10.8%SmolLM 135M · 135M · 11.2%SmolLM 135MSmolLM2 360M · 362M · 11.4%SmolLM2 360MQwen2.5 0.5B · 494M · 14.9%Qwen2.5 0.5BQwen3.5 0.8B · 873M · 29.7%Qwen3.5 0.8BQwen2.5 1.5B · 2B · 32.1%Qwen2.5 1.5BQwen2.5 3B · 3B · 43.7%Qwen2.5 3BPhi 3.5 Mini Instruct · 4B · 47.9%Phi 3.5 Mini InstructQwen3.5 4B · 5B · 79.1%Qwen3.5 4BQwen3.5 9B · 10B · 82.5%Qwen3.5 9BSeed OSS 36B Instruct · 36B · 82.7%Qwen3.5 122B A10B · 125B · 86.7%Qwen3.5 122B A10BMiniMax M2.1 · 229B · 88.0%MiniMax M2.1
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • SmolLM 135M, 135M, score 11.2% — on the efficiency frontier (best score at its size or smaller).
  • SmolLM2 360M, 362M, score 11.4% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 0.5B, 494M, score 14.9% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3.5 0.8B, 873M, score 29.7% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 1.5B, 2B, score 32.1% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 3B, 3B, score 43.7% — on the efficiency frontier (best score at its size or smaller).
  • Phi 3.5 Mini Instruct, 4B, score 47.9% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3.5 4B, 5B, score 79.1% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3.5 9B, 10B, score 82.5% — on the efficiency frontier (best score at its size or smaller).
  • Seed OSS 36B Instruct, 36B, score 82.7% — on the efficiency frontier (best score at its size or smaller).
  • Qwen3.5 122B A10B, 125B, score 86.7% — on the efficiency frontier (best score at its size or smaller).
  • MiniMax M2.1, 229B, score 88.0% — on the efficiency frontier (best score at its size or smaller).

MMLU-Pro: frequently asked questions

What is the best open LLM on MMLU-Pro?
MiniMax M2.1 is the top open model on MMLU-Pro, scoring 88.0%. Among all models tested — including proprietary ones — it ranks #6. The top model overall is Gemini-3.1-Pro (Google) at 91.2%.
What's the best MMLU-Pro model you can run on a 24 GB GPU?
Seed OSS 36B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 20 GB), scoring 82.7% on MMLU-Pro.
What's the best MMLU-Pro model you can run on a 12 GB GPU?
Qwen3.5 9B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 82.5% on MMLU-Pro.
Can open models match proprietary models on MMLU-Pro?
Not quite on MMLU-Pro: the strongest proprietary model (Gemini-3.1-Pro) scores 91.2%, ahead of the best open model (MiniMax M2.1) at 88.0% — but you can run the open one yourself.

Scores aggregated from tigerlab. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.