What is the best open LLM on MMLU?

DeepSeek v3 is the top open model on MMLU, scoring 87.2%. Among all models tested — including proprietary ones — it ranks #3. The top model overall is GPT 4o (Nov 20, 2024) (OpenAI) at 88.1%.

What's the best MMLU model you can run on a 24 GB GPU?

Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.

What's the best MMLU model you can run on a 12 GB GPU?

Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.

Can open models match proprietary models on MMLU?

Not quite on MMLU: the strongest proprietary model (GPT 4o (Nov 20, 2024)) scores 88.1%, ahead of the best open model (DeepSeek v3) at 87.2% — but you can run the open one yourself.

Knowledge

MMLU Leaderboard

Name: MMLU — open LLM scores
Creator: epoch

MMLU (Massive Multitask Language Understanding) spans 57 subjects from history to law to medicine as multiple-choice questions. It is the long-standing default for broad knowledge and remains the most widely-reported general benchmark.

Source: epoch76 open models ranked+60 proprietaryData through Feb 2025

Open models All models

Open models ranked on MMLU

# shows rank among open models / rank overall (including proprietary).

#	Model	Score
1 / 3	DeepSeek v3 · 684.5B	87.2%
2 / 7	Llama 3.3 70B Instruct · 70.6B	86.3%
3 / 9	Qwen2.5 72B Instruct · 72.7B	85.3%
4 / 10	Qwen2.5 72B · 72.7B	85.0%
5 / 11	Phi 4 · 14.7B	84.8%
6 / 13	Llama 3.1 405B Instruct · 405.9B	84.5%
7 / 14	Llama 3.1 405B · 405.9B	84.4%
8 / 19	Qwen2 72B Instruct · 72.7B	82.4%
9 / 23	Llama 3.2 90B Vision Instruct · 88.6B	80.3%
10 / 24	Llama 3.1 70B Instruct · 70.6B	80.1%
11 / 26	Qwen2.5 14B Instruct · 14.8B	79.9%
12 / 29	Meta Llama 3 70B Instruct · 70.6B	79.3%
13 / 31	Qwen2.5 Coder 32B · 32.8B	79.1%
14 / 33	DeepSeek v2 · 235.7B	78.4%
15 / 37	Mixtral 8x22B v0.1 · 140.6B	77.8%
16 / 40	Yi 34B · 34.4B	76.3%
17 / 42	Gemma 2 27B IT · 27.2B	75.7%
18 / 43	Phi 3 Small 8k Instruct · 7.4B	75.7%
19 / 44	Qwen2.5 Coder 14B · 14.8B	75.2%
20 / 45	Qwen1.5 32B · 32.5B	74.4%
21 / 50	Yi 34B Chat · 34.4B	73.5%
22 / 53	Qwen2.5 7B Instruct · 7.6B	72.9%
23 / 55	Gemma 2 9B IT · 9.2B	72.1%
24 / 58	Falcon 180B · 180B	70.6%
25 / 59	Mixtral 8x7B v0.1 · 46.7B	70.6%
26 / 62	Llama 2 70B HF · 69.0B	69.9%
27 / 66	Meta Llama 3 8B Instruct · 8.0B	68.8%
28 / 68	Phi 3 Mini 4k Instruct · 3.8B	68.8%
29 / 70	Qwen1.5 14B · 14.2B	68.6%
30 / 71	StableBeluga2 · 70B	68.6%
31 / 72	Yi 9B · 8.8B	68.4%
32 / 73	Qwen2.5 Coder 7B · 7.6B	68.0%
33 / 76	Qwen 14B · 14.2B	66.3%
34 / 77	Gemma 7B · 8.5B	66.1%
35 / 79	Qwen 14B Chat · 14.2B	65.0%
36 / 80	Starcoder2 15B · 16.0B	64.1%
37 / 81	Yi 6B · 6.1B	64.0%
38 / 84	Qwen1.5 7B · 7.7B	62.6%
39 / 85	Mistral 7B Instruct v0.2 · 7.2B	62.5%
40 / 86	Mistral 7B v0.1 · 7B	62.5%
41 / 87	Yi 6B Chat · 6.1B	61.0%
42 / 88	DeepSeek Coder v2 Lite Base · 15.7B	60.5%
43 / 90	Llama 2 70B Chat HF · 69.0B	59.9%
44 / 91	Mistral 7B Instruct v0.3 · 7.2B	59.9%
45 / 92	Baichuan2 13B Base · 13B	59.2%
46 / 95	Falcon 11B · 11.1B	58.4%
47 / 96	Phi 2 · 2.8B	58.4%
48 / 97	Internlm Chat 20B · 20B	57.4%
49 / 98	Falcon 40B · 41.8B	56.9%
50 / 99	Llama 3.2 11B Vision Instruct · 10.7B	56.5%
51 / 100	Llama 3.1 8B Instruct · 8.0B	56.1%
52 / 101	Llama 2 13B HF · 13.0B	55.6%
53 / 102	Baichuan2 13B Chat · 13B	55.1%
54 / 103	Baichuan2 7B Base · 7B	54.2%
55 / 105	Qwen2.5 Coder 1.5B · 1.5B	53.6%
56 / 106	Baichuan 13B Base · 13B	51.6%
57 / 107	Internlm 7B · 7B	51.0%
58 / 108	Llama 2 13B Chat HF · 13.0B	50.9%
59 / 109	INTELLECT 1 Instruct · 10.2B	49.9%
60 / 110	Chatglm2 6B · 6B	47.9%
61 / 113	Llama 2 7B HF · 6.7B	45.8%
62 / 114	Qwen 7B · 7.7B	45.0%
63 / 116	Baichuan 7B · 7B	42.3%
64 / 117	Gemma 2B · 2.5B	42.3%
65 / 118	Qwen2.5 Coder 0.5B · 494M	42.0%
66 / 119	CodeQwen1.5 7B · 7.3B	40.5%
67 / 121	Starcoder2 7B · 7.2B	38.8%
68 / 122	Phi 1 5 · 1.4B	37.6%
69 / 123	Starcoder2 3B · 3.0B	36.6%
70 / 125	Xgen 7B 8k Base · 7B	36.3%
71 / 126	Llama 7B · 6.7B	35.6%
72 / 127	Falcon 7B · 7.2B	35.0%
73 / 130	Qwen 1 8B · 1.8B	28.2%
74 / 132	Cerebras GPT 13B · 13B	26.2%
75 / 134	Deepseek Coder 1.3B Base · 1.3B	25.8%
76 / 135	GPT J 6B · 6B	25.7%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

MMLU: frequently asked questions

What is the best open LLM on MMLU?: DeepSeek v3 is the top open model on MMLU, scoring 87.2%. Among all models tested — including proprietary ones — it ranks #3. The top model overall is GPT 4o (Nov 20, 2024) (OpenAI) at 88.1%.
What's the best MMLU model you can run on a 24 GB GPU?: Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.
What's the best MMLU model you can run on a 12 GB GPU?: Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.
Can open models match proprietary models on MMLU?: Not quite on MMLU: the strongest proprietary model (GPT 4o (Nov 20, 2024)) scores 88.1%, ahead of the best open model (DeepSeek v3) at 87.2% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.