What is the best open LLM on MMLU-Pro?

MiniMax M2.1 is the top open model on MMLU-Pro, scoring 88.0%. Among all models tested — including proprietary ones — it ranks #6. The top model overall is Gemini-3.1-Pro (Google) at 91.2%.

What's the best MMLU-Pro model you can run on a 24 GB GPU?

Seed OSS 36B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 20 GB), scoring 82.7% on MMLU-Pro.

What's the best MMLU-Pro model you can run on a 12 GB GPU?

Qwen3.5 9B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 82.5% on MMLU-Pro.

Can open models match proprietary models on MMLU-Pro?

Not quite on MMLU-Pro: the strongest proprietary model (Gemini-3.1-Pro) scores 91.2%, ahead of the best open model (MiniMax M2.1) at 88.0% — but you can run the open one yourself.

Knowledge

MMLU-Pro Leaderboard

Name: MMLU-Pro — open LLM scores
Creator: tigerlab

MMLU-Pro is a harder, cleaned-up successor to MMLU with ten answer choices and more reasoning-heavy questions across 14 subjects, measuring broad knowledge and reasoning together.

Source: tigerlab97 open models ranked+163 proprietary

Open models All models

Open models ranked on MMLU-Pro

# shows rank among open models / rank overall (including proprietary).

#	Model	Score
1 / 6	MiniMax M2.1 · 228.7B	88.0%
2 / 21	Qwen3.5 122B A10B · 125.1B	86.7%
3 / 30	GLM 5 · 753.9B	86.0%
4 / 36	GLM 4.5 · 358.3B	84.6%
5 / 38	Qwen3 235B A22B Thinking 2507 · 235.1B	84.5%
6 / 41	DeepSeek R1 · 684.5B	84.0%
7 / 46	DeepSeek R1 0528 · 684.5B	83.4%
8 / 49	Qwen3 235B A22B Instruct 2507 · 235.1B	83.0%
9 / 51	LongCat Flash Chat · 561.9B	82.7%
10 / 52	Seed OSS 36B Instruct · 36.2B	82.7%
11 / 53	Qwen3.5 9B · 9.7B	82.5%
12 / 54	MiniMax M2 · 228.7B	82.0%
13 / 56	GLM 4.5 Air · 110.5B	81.4%
14 / 57	DeepSeek v3 0324 · 684.5B	81.3%
15 / 59	Kimi K2 Instruct · 1026.5B	81.0%
16 / 60	Qwen3 30B A3B Thinking 2507 · 30.5B	80.9%
17 / 65	MiniMax M2.5 · 228.7B	80.1%
18 / 69	Qwen3.5 4B · 4.7B	79.1%
19 / 82	Phi 4 Reasoning Plus · 14.7B	76.0%
20 / 83	DeepSeek v3 · 684.5B	75.9%
21 / 84	MiniMax Text 01 · 456.1B	75.7%
22 / 89	Phi 4 Reasoning · 14.7B	74.3%
23 / 91	Llama 3.1 405B Instruct · 405.9B	73.3%
24 / 97	Qwen2.5 72B · 72.7B	71.6%
25 / 99	QwQ 32B Preview · 32.8B	71.0%
26 / 100	Phi 4 · 14.7B	70.4%
27 / 104	Qwen2.5 32B · 32.8B	69.2%
28 / 106	QwQ 32B · 32.8B	69.1%
29 / 109	Qwen3 235B A22B · 235.1B	68.2%
30 / 111	Gemma 3 27B IT · 27.4B	67.5%
31 / 116	Llama 3.3 70B Instruct · 70.6B	65.9%
32 / 122	Qwen2 72B Instruct · 72.7B	64.4%
33 / 126	Qwen2.5 14B · 14.8B	63.7%
34 / 127	DeepSeek Coder v2 Instruct · 235.7B	63.6%
35 / 128	Higgs Llama 3 70B · 70.6B	63.2%
36 / 131	Llama 3.1 70B Instruct · 70.6B	62.8%
37 / 132	Llama 3.1 Nemotron 70B Instruct HF · 70.6B	62.8%
38 / 136	Qwen3 30B A3B Base · 30.5B	61.7%
39 / 137	Llama 3.1 405B · 405.9B	61.6%
40 / 138	Gemma 3 12B IT · 12.2B	60.6%
41 / 141	Reflection Llama 3.1 70B · 70B	60.4%
42 / 146	MiMo 7B RL · 7.8B	58.6%
43 / 149	Internlm3 8B Instruct · 8.8B	57.6%
44 / 152	Gemma 2 27B IT · 27.2B	56.5%
45 / 154	Meta Llama 3 70B Instruct · 70.6B	56.2%
46 / 163	Qwen1.5 72B Chat · 72.3B	52.6%
47 / 164	Llama 3.1 70B · 70.6B	52.5%
48 / 165	Yi 1.5 34B Chat · 34.4B	52.3%
49 / 166	Gemma 2 9B IT · 9.2B	52.1%
50 / 171	Mistral Small Instruct 2409 · 22.2B	48.4%
51 / 174	Phi 3.5 Mini Instruct · 3.8B	47.9%
52 / 181	Gemma 2 9B · 9.2B	45.1%
53 / 182	Qwen2.5 7B · 7.6B	45.0%
54 / 183	Mistral Nemo Instruct 2407 · 12.2B	44.8%
55 / 184	Llama 3.1 8B Instruct · 8.0B	44.3%
56 / 187	Qwen2.5 3B · 3.1B	43.7%
57 / 188	Gemma3 4B IT · 4B	43.6%
58 / 190	Mixtral 8x7B Instruct v0.1 · 46.7B	43.3%
59 / 191	Yi 34B · 34.4B	43.0%
60 / 194	MiMo 7B Base · 7.8B	41.9%
61 / 195	DeepSeek Coder v2 Lite Instruct · 15.7B	41.6%
62 / 197	Mixtral 8x7B v0.1 · 46.7B	41.0%
63 / 202	WizardLM 2 8x22B · 140.6B	39.2%
64 / 204	Yi 1.5 6B Chat · 6.1B	38.2%
65 / 205	Qwen1.5 14B Chat · 14.2B	38.0%
66 / 207	C4ai Command R V01 · 35.0B	37.9%
67 / 209	Llama 2 70B HF · 69.0B	37.5%
68 / 214	Llama 3.1 8B · 8.0B	36.6%
69 / 217	DeepSeek Coder v2 Lite Base · 15.7B	34.4%
70 / 218	Aya Expanse 8B · 8.0B	33.7%
71 / 219	Gemma 7B · 8.5B	33.7%
72 / 222	Zephyr 7B Beta · 7.2B	33.0%
73 / 223	Qwen2.5 1.5B · 1.5B	32.1%
74 / 226	Mistral 7B v0.1 · 7B	30.9%
75 / 227	Mistral 7B Instruct v0.2 · 7B	30.8%
76 / 228	Mistral 7B v0.2 · 7.2B	30.4%
77 / 229	Qwen3.5 0.8B · 873M	29.7%
78 / 230	Qwen1.5 7B Chat · 7.7B	29.1%
79 / 231	Yi 6B Chat · 6.1B	28.8%
80 / 233	Yi 6B · 6.1B	26.5%
81 / 235	Mistral 7B Instruct v0.1 · 7B	25.8%
82 / 237	Llama 2 13B HF · 13.0B	25.3%
83 / 239	Llemma 7B · 7B	23.4%
84 / 241	Qwen2 1.5B · 1.5B	22.6%
85 / 242	Llama 3.2 3B · 3.2B	22.2%
86 / 245	Llama 2 7B · 7B	20.3%
87 / 246	SmolLM2 1.7B · 1.7B	18.3%
88 / 248	Gemma 2B · 2.5B	15.8%
89 / 249	Gemma 2 2B IT · 2.6B	15.6%
90 / 251	Qwen2.5 0.5B · 494M	14.9%
91 / 252	Gemma 3 1B IT · 1000M	14.7%
92 / 254	Granite 3.1 1B A400m Base · 1.3B	12.3%
93 / 255	Llama 3.2 1B · 1.2B	11.9%
94 / 256	SmolLM 1.7B · 1.7B	11.9%
95 / 257	SmolLM2 360M · 362M	11.4%
96 / 258	SmolLM 135M · 135M	11.2%
97 / 260	SmolLM2 135M · 135M	10.8%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

MMLU-Pro: frequently asked questions

What is the best open LLM on MMLU-Pro?: MiniMax M2.1 is the top open model on MMLU-Pro, scoring 88.0%. Among all models tested — including proprietary ones — it ranks #6. The top model overall is Gemini-3.1-Pro (Google) at 91.2%.
What's the best MMLU-Pro model you can run on a 24 GB GPU?: Seed OSS 36B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 20 GB), scoring 82.7% on MMLU-Pro.
What's the best MMLU-Pro model you can run on a 12 GB GPU?: Qwen3.5 9B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 82.5% on MMLU-Pro.
Can open models match proprietary models on MMLU-Pro?: Not quite on MMLU-Pro: the strongest proprietary model (Gemini-3.1-Pro) scores 91.2%, ahead of the best open model (MiniMax M2.1) at 88.0% — but you can run the open one yourself.

Scores aggregated from tigerlab. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.