What is the best open LLM on MMLU?

DeepSeek v3 is the top open model on MMLU, scoring 87.2%. Among all models tested — including proprietary ones — it ranks #3. The top model overall is GPT 4o (Nov 20, 2024) (OpenAI) at 88.1%.

What's the best MMLU model you can run on a 24 GB GPU?

Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.

What's the best MMLU model you can run on a 12 GB GPU?

Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.

Can open models match proprietary models on MMLU?

Not quite on MMLU: the strongest proprietary model (GPT 4o (Nov 20, 2024)) scores 88.1%, ahead of the best open model (DeepSeek v3) at 87.2% — but you can run the open one yourself.

Knowledge

MMLU Leaderboard

Name: MMLU — open LLM scores
Creator: epoch

MMLU (Massive Multitask Language Understanding) spans 57 subjects from history to law to medicine as multiple-choice questions. It is the long-standing default for broad knowledge and remains the most widely-reported general benchmark.

Source: epoch76 open models ranked+60 proprietaryData through Feb 2025

Open models All models

All models ranked on MMLU

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#	Model	Score
1	GPT 4o (Nov 20, 2024) · proprietary	88.1%
2	Claude 3.5 Sonnet (Oct 22, 2024) · proprietary	87.3%
3	DeepSeek v3 · 684.5B	87.2%
4	Gemini 1.5 Pro 002 · proprietary	86.9%
5	Claude 3.5 Sonnet (Jun 20, 2024) · proprietary	86.5%
6	GPT 4 (Mar 14) · proprietary	86.4%
7	Llama 3.3 70B Instruct · 70.6B	86.3%
8	Gemini 1.5 Pro 001 · proprietary	85.9%
9	Qwen2.5 72B Instruct · 72.7B	85.3%
10	Qwen2.5 72B · 72.7B	85.0%
11	Phi 4 · 14.7B	84.8%
12	Claude 3 Opus (Feb 29, 2024) · proprietary	84.6%
13	Llama 3.1 405B Instruct · 405.9B	84.5%
14	Llama 3.1 405B · 405.9B	84.4%
15	GPT 4o (Aug 06, 2024) · proprietary	84.3%
16	GPT 4o (May 13, 2024) · proprietary	84.2%
17	Gemini 1.5 Pro 001 Feb24 · proprietary	82.7%
18	GPT 4 (Jun 13) · proprietary	82.4%
19	Qwen2 72B Instruct · 72.7B	82.4%
20	Amazon.nova Pro v1:0 · proprietary	82.0%
21	GPT 4o Mini (Jul 18, 2024) · proprietary	81.8%
22	GPT 4 Turbo (Apr 09, 2024) · proprietary	81.3%
23	Llama 3.2 90B Vision Instruct · 88.6B	80.3%
24	Llama 3.1 70B Instruct · 70.6B	80.1%
25	Mistral Large 2407 · proprietary	80.0%
26	Qwen2.5 14B Instruct · 14.8B	79.9%
27	Gemini 2.0 Flash Exp · proprietary	79.7%
28	GPT 4 Turbo · proprietary	79.6%
29	Meta Llama 3 70B Instruct · 70.6B	79.3%
30	Yi Large · proprietary	79.3%
31	Qwen2.5 Coder 32B · 32.8B	79.1%
32	Claude 2.0 · proprietary	78.5%
33	DeepSeek v2 · 235.7B	78.4%
34	Phi 3 Medium 128K Instruct · proprietary	78.0%
35	Gemini 1.5 Flash 001 · proprietary	77.9%
36	Gemini 1.5 Flash (May 14) · proprietary	77.8%
37	Mixtral 8x22B v0.1 · 140.6B	77.8%
38	Amazon.nova Lite v1:0 · proprietary	77.0%
39	Claude 1.3 · proprietary	77.0%
40	Yi 34B · 34.4B	76.3%
41	Claude 3 Sonnet (Feb 29, 2024) · proprietary	75.9%
42	Gemma 2 27B IT · 27.2B	75.7%
43	Phi 3 Small 8k Instruct · 7.4B	75.7%
44	Qwen2.5 Coder 14B · 14.8B	75.2%
45	Qwen1.5 32B · 32.5B	74.4%
46	Claude 3.5 Haiku (Oct 22, 2024) · proprietary	74.3%
47	Gemini 1.5 Flash 002 · proprietary	73.9%
48	Claude 3 Haiku (Mar 07, 2024) · proprietary	73.8%
49	Claude 2.1 · proprietary	73.5%
50	Yi 34B Chat · 34.4B	73.5%
51	Claude Instant 1.1 · proprietary	73.4%
52	Claude Instant 1.2 · proprietary	73.2%
53	Qwen2.5 7B Instruct · 7.6B	72.9%
54	Inflection 1 · proprietary	72.7%
55	Gemma 2 9B IT · 9.2B	72.1%
56	GPT 3.5 Turbo (Nov 06) · proprietary	71.4%
57	Amazon.nova Micro v1:0 · proprietary	70.8%
58	Falcon 180B · 180B	70.6%
59	Mixtral 8x7B v0.1 · 46.7B	70.6%
60	Gemini 1.0 Pro 001 · proprietary	70.0%
61	Text Davinci 002 · proprietary	70.0%
62	Llama 2 70B HF · 69.0B	69.9%
63	C4ai Command R Plus (Aug 2024) · proprietary	69.4%
64	PaLM 540B · proprietary	69.3%
65	GPT 3.5 Turbo (Jun 13) · proprietary	68.9%
66	Meta Llama 3 8B Instruct · 8.0B	68.8%
67	Mistral Large 2402 · proprietary	68.8%
68	Phi 3 Mini 4k Instruct · 3.8B	68.8%
69	Mistral Small 2402 · proprietary	68.7%
70	Qwen1.5 14B · 14.2B	68.6%
71	StableBeluga2 · 70B	68.6%
72	Yi 9B · 8.8B	68.4%
73	Qwen2.5 Coder 7B · 7.6B	68.0%
74	Chinchilla (70B) · proprietary	67.5%
75	GPT 3.5 Turbo (Jan 25) · proprietary	67.3%
76	Qwen 14B · 14.2B	66.3%
77	Gemma 7B · 8.5B	66.1%
78	C4ai Command R (Aug 2024) · proprietary	65.2%
79	Qwen 14B Chat · 14.2B	65.0%
80	Starcoder2 15B · 16.0B	64.1%
81	Yi 6B · 6.1B	64.0%
82	Llama 65B · proprietary	63.4%
83	Llama 2 34B · proprietary	62.6%
84	Qwen1.5 7B · 7.7B	62.6%
85	Mistral 7B Instruct v0.2 · 7.2B	62.5%
86	Mistral 7B v0.1 · 7B	62.5%
87	Yi 6B Chat · 6.1B	61.0%
88	DeepSeek Coder v2 Lite Base · 15.7B	60.5%
89	Gopher (280B) · proprietary	60.0%
90	Llama 2 70B Chat HF · 69.0B	59.9%
91	Mistral 7B Instruct v0.3 · 7.2B	59.9%
92	Baichuan2 13B Base · 13B	59.2%
93	Llama 33B · proprietary	58.7%
94	Nemotron 4 15B · proprietary	58.7%
95	Falcon 11B · 11.1B	58.4%
96	Phi 2 · 2.8B	58.4%
97	Internlm Chat 20B · 20B	57.4%
98	Falcon 40B · 41.8B	56.9%
99	Llama 3.2 11B Vision Instruct · 10.7B	56.5%
100	Llama 3.1 8B Instruct · 8.0B	56.1%
101	Llama 2 13B HF · 13.0B	55.6%
102	Baichuan2 13B Chat · 13B	55.1%
103	Baichuan2 7B Base · 7B	54.2%
104	PaLM 62B · proprietary	53.7%
105	Qwen2.5 Coder 1.5B · 1.5B	53.6%
106	Baichuan 13B Base · 13B	51.6%
107	Internlm 7B · 7B	51.0%
108	Llama 2 13B Chat HF · 13.0B	50.9%
109	INTELLECT 1 Instruct · 10.2B	49.9%
110	Chatglm2 6B · 6B	47.9%
111	Mpt 30B · proprietary	47.9%
112	Llama 13B · proprietary	47.7%
113	Llama 2 7B HF · 6.7B	45.8%
114	Qwen 7B · 7.7B	45.0%
115	Text Davinci 001 · proprietary	43.9%
116	Baichuan 7B · 7B	42.3%
117	Gemma 2B · 2.5B	42.3%
118	Qwen2.5 Coder 0.5B · 494M	42.0%
119	CodeQwen1.5 7B · 7.3B	40.5%
120	DeepSeek Coder 33B Base · proprietary	39.4%
121	Starcoder2 7B · 7.2B	38.8%
122	Phi 1 5 · 1.4B	37.6%
123	Starcoder2 3B · 3.0B	36.6%
124	DeepSeek Coder 6.7b Base · proprietary	36.4%
125	Xgen 7B 8k Base · 7B	36.3%
126	Llama 7B · 6.7B	35.6%
127	Falcon 7B · 7.2B	35.0%
128	Mpt 7B · proprietary	30.8%
129	Open Llama 7B · proprietary	29.9%
130	Qwen 1 8B · 1.8B	28.2%
131	RedPajama INCITE 7B Base · proprietary	26.3%
132	Cerebras GPT 13B · 13B	26.2%
133	Dolly v2 12B · proprietary	26.2%
134	Deepseek Coder 1.3B Base · 1.3B	25.8%
135	GPT J 6B · 6B	25.7%
136	Opt 13B · proprietary	25.1%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

MMLU: frequently asked questions

What is the best open LLM on MMLU?: DeepSeek v3 is the top open model on MMLU, scoring 87.2%. Among all models tested — including proprietary ones — it ranks #3. The top model overall is GPT 4o (Nov 20, 2024) (OpenAI) at 88.1%.
What's the best MMLU model you can run on a 24 GB GPU?: Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.
What's the best MMLU model you can run on a 12 GB GPU?: Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.
Can open models match proprietary models on MMLU?: Not quite on MMLU: the strongest proprietary model (GPT 4o (Nov 20, 2024)) scores 88.1%, ahead of the best open model (DeepSeek v3) at 87.2% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.