Knowledge

MMLU Leaderboard

MMLU (Massive Multitask Language Understanding) spans 57 subjects from history to law to medicine as multiple-choice questions. It is the long-standing default for broad knowledge and remains the most widely-reported general benchmark.

Source: epoch36 open models ranked+100 proprietaryData through Feb 2025

All models ranked on MMLU

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gpt-4o-2024-11-20 · proprietary
88.1%
2claude-3-5-sonnet-20241022 · proprietary
87.3%
3DeepSeek-V3 · proprietary
87.1%
4gemini-1.5-pro-002 · proprietary
86.9%
5claude-3-5-sonnet-20240620 · proprietary
86.5%
6gpt-4-0314 · proprietary
86.4%
7Llama 3.3 70B Instruct · 70.6B
86.3%
8gemini-1.5-pro-001 · proprietary
85.9%
9Qwen2.5-72B · proprietary
85.0%
10Phi 4 · 14.7B
84.8%
11claude-3-opus-20240229 · proprietary
84.6%
12Llama-3.1-405B-Instruct · proprietary
84.5%
13Llama 3.1 405B · 405B
84.4%
14gpt-4o-2024-08-06 · proprietary
84.3%
15gpt-4o-2024-05-13 · proprietary
84.2%
16Qwen2.5 72B Instruct · 72.7B
83.4%
17gpt-4-0613 · proprietary
82.4%
18qwen2-72b-instruct · proprietary
82.4%
19amazon.nova-pro-v1:0 · proprietary
82.0%
20gemini-1.5-pro-001-feb24 · proprietary
81.9%
21gpt-4-turbo-2024-04-09 · proprietary
81.3%
22Llama-3.2-90B-Vision-Instruct · proprietary
80.3%
23Llama 3.1 70B Instruct · 70.6B
80.1%
24mistral-large-2407 · proprietary
80.0%
25Qwen2.5 14B Instruct · 14.8B
79.9%
26gemini-2.0-flash-exp · proprietary
79.7%
27gpt-4-turbo · proprietary
79.6%
28Meta Llama 3 70B Instruct · 70.6B
79.3%
29Yi-large · proprietary
79.3%
30Qwen2.5-Coder-32B · proprietary
79.1%
31claude-2.0 · proprietary
78.5%
32DeepSeek-V2 · proprietary
78.4%
33Phi-3-medium-128k-instruct · proprietary
78.0%
34gemini-1.5-flash-001 · proprietary
77.9%
35gemini-1.5-flash-0514 · proprietary
77.8%
36Mixtral-8x22B-v0.1 · proprietary
77.8%
37amazon.nova-lite-v1:0 · proprietary
77.0%
38claude-1.3 · proprietary
77.0%
39gpt-4o-mini-2024-07-18 · proprietary
76.7%
40Yi-34B · proprietary
76.3%
41claude-3-sonnet-20240229 · proprietary
75.9%
42Gemma 2 27B IT · 27.2B
75.7%
43Phi-3-small-8k-instruct · proprietary
75.7%
44Qwen2.5 Coder 14B · 14.8B
75.2%
45qwen1.5-32B · proprietary
74.4%
46claude-3-5-haiku-20241022 · proprietary
74.3%
47gemini-1.5-flash-002 · proprietary
73.9%
48claude-3-haiku-20240307 · proprietary
73.8%
49claude-2.1 · proprietary
73.5%
50Yi-34B-Chat · proprietary
73.5%
51claude-instant-1.1 · proprietary
73.4%
52claude-instant-1.2 · proprietary
73.2%
53Qwen2.5 7B Instruct · 7.6B
72.9%
54Inflection-1 · proprietary
72.7%
55Gemma 2 9B IT · 9.2B
72.1%
56gpt-3.5-turbo-1106 · proprietary
71.4%
57amazon.nova-micro-v1:0 · proprietary
70.8%
58Falcon 180B · 180B
70.6%
59Mixtral-8x7B-v0.1 · proprietary
70.5%
60gemini-1.0-pro-001 · proprietary
70.0%
61text-davinci-002 · proprietary
70.0%
62Llama 2 70B HF · 69.0B
69.9%
63c4ai-command-r-plus-08-2024 · proprietary
69.4%
64PaLM 540B · proprietary
69.3%
65gpt-3.5-turbo-0613 · proprietary
68.9%
66mistral-large-2402 · proprietary
68.8%
67Phi 3 Mini 4k Instruct · 3.8B
68.8%
68mistral-small-2402 · proprietary
68.7%
69qwen1.5-14B · proprietary
68.6%
70StableBeluga2 · proprietary
68.6%
71Yi-9B · proprietary
68.4%
72Qwen2.5 Coder 7B · 7.6B
68.0%
73Chinchilla (70B) · proprietary
67.5%
74gpt-3.5-turbo-0125 · proprietary
67.3%
75Meta Llama 3 8B Instruct · 8.0B
66.5%
76c4ai-command-r-08-2024 · proprietary
65.2%
77Qwen-14B-Chat · proprietary
65.0%
78Starcoder2 15B · 16.0B
64.1%
79Gemma 7B · 8.5B
63.6%
80LLaMA-65B · proprietary
63.4%
81Yi-6B · proprietary
63.2%
82Llama-2-34b · proprietary
62.6%
83Qwen1.5-7B · proprietary
62.6%
84Mistral 7B Instruct v0.2 · 7B
62.5%
85Yi-6B-Chat · proprietary
61.0%
86DeepSeek Coder v2 Lite Base · 15.7B
60.5%
87Gopher (280B) · proprietary
60.0%
88Llama 2 70B Chat · 70B
59.9%
89Mistral 7B Instruct v0.3 · 7.2B
59.9%
90Baichuan-2-13B-Base · proprietary
59.2%
91Nemotron-4 15B · proprietary
58.7%
92falcon-11b · proprietary
58.4%
93LLaMA-33B · proprietary
57.8%
94internlm-chat-20b · proprietary
57.4%
95Mistral 7B v0.1 · 7B
56.6%
96Llama-3.2-11B-Vision-Instruct · proprietary
56.5%
97Phi 2 · 2.8B
56.3%
98Llama 3.1 8B Instruct · 8.0B
56.1%
99falcon-40b · proprietary
55.4%
100Llama-2-13b · proprietary
54.8%
101Baichuan-2-7B-Base · proprietary
54.2%
102PaLM 62B · proprietary
53.7%
103Qwen2.5 Coder 1.5B · 1.5B
53.6%
104Qwen-14B · proprietary
53.4%
105Baichuan-13B-Base · proprietary
51.6%
106internlm-7b · proprietary
51.0%
107Baichuan2-13B-Chat · proprietary
50.1%
108INTELLECT-1-Instruct · proprietary
49.9%
109chatglm2-6b · proprietary
47.9%
110Llama-2-13b-chat · proprietary
47.3%
111mpt-30b · proprietary
46.9%
112LLaMA-13B · proprietary
46.3%
113Llama 2 7B · 7B
45.3%
114Qwen-7B · proprietary
45.0%
115text-davinci-001 · proprietary
43.9%
116Baichuan-7B · proprietary
42.3%
117Gemma 2B · 2.5B
42.3%
118Qwen2.5 Coder 0.5B · 494M
42.0%
119CodeQwen1.5-7B · proprietary
40.5%
120deepseek-coder-33b-base · proprietary
39.4%
121Starcoder2 7B · 7.2B
38.8%
122Phi 1 5 · 1.4B
37.6%
123starcoder2-3b · proprietary
36.6%
124deepseek-coder-6.7b-base · proprietary
36.4%
125Llama 7B · 6.7B
35.2%
126xgen-7b-8k-base · proprietary
32.1%
127open_llama_7b · proprietary
28.6%
128Qwen-1_8B · proprietary
28.2%
129mpt-7b · proprietary
27.4%
130Deepseek Coder 1.3B Base · 1.3B
25.8%
131RedPajama-INCITE-7B-Base · proprietary
25.8%
132GPT J 6B · 6B
25.7%
133dolly-v2-12b · proprietary
25.4%
134Cerebras GPT 13B · 13B
24.6%
135opt-13b · proprietary
24.4%
136Falcon 7B · 7.2B
23.9%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

1B10B100Bmodel size (log scale) →86.3%23.9%Llama 3.1 405B · 405B · 84.4%Qwen2.5 72B Instruct · 73B · 83.4%Llama 3.1 70B Instruct · 71B · 80.1%Qwen2.5 14B Instruct · 15B · 79.9%Meta Llama 3 70B Instruct · 71B · 79.3%Gemma 2 27B IT · 27B · 75.7%Qwen2.5 Coder 14B · 15B · 75.2%Gemma 2 9B IT · 9B · 72.1%Falcon 180B · 180B · 70.6%Llama 2 70B HF · 69B · 69.9%Qwen2.5 Coder 7B · 8B · 68.0%Meta Llama 3 8B Instruct · 8B · 66.5%Starcoder2 15B · 16B · 64.1%Gemma 7B · 9B · 63.6%Mistral 7B Instruct v0.2 · 7B · 62.5%DeepSeek Coder v2 Lite Base · 16B · 60.5%Llama 2 70B Chat · 70B · 59.9%Mistral 7B Instruct v0.3 · 7B · 59.9%Mistral 7B v0.1 · 7B · 56.6%Llama 3.1 8B Instruct · 8B · 56.1%Llama 2 7B · 7B · 45.3%Gemma 2B · 3B · 42.3%Starcoder2 7B · 7B · 38.8%Phi 1 5 · 1B · 37.6%Llama 7B · 7B · 35.2%Deepseek Coder 1.3B Base · 1B · 25.8%GPT J 6B · 6B · 25.7%Cerebras GPT 13B · 13B · 24.6%Falcon 7B · 7B · 23.9%Qwen2.5 Coder 0.5B · 494M · 42.0%Qwen2.5 Coder 0.5BQwen2.5 Coder 1.5B · 2B · 53.6%Phi 2 · 3B · 56.3%Phi 2Phi 3 Mini 4k Instruct · 4B · 68.8%Phi 3 Mini 4k InstructQwen2.5 7B Instruct · 8B · 72.9%Qwen2.5 7B InstructPhi 4 · 15B · 84.8%Phi 4Llama 3.3 70B Instruct · 71B · 86.3%Llama 3.3 70B Instruct
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Qwen2.5 Coder 0.5B, 494M, score 42.0% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 Coder 1.5B, 2B, score 53.6% — on the efficiency frontier (best score at its size or smaller).
  • Phi 2, 3B, score 56.3% — on the efficiency frontier (best score at its size or smaller).
  • Phi 3 Mini 4k Instruct, 4B, score 68.8% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 7B Instruct, 8B, score 72.9% — on the efficiency frontier (best score at its size or smaller).
  • Phi 4, 15B, score 84.8% — on the efficiency frontier (best score at its size or smaller).
  • Llama 3.3 70B Instruct, 71B, score 86.3% — on the efficiency frontier (best score at its size or smaller).

MMLU: frequently asked questions

What is the best open LLM on MMLU?
Llama 3.3 70B Instruct is the top open model on MMLU, scoring 86.3%. Among all models tested — including proprietary ones — it ranks #7.
What's the best MMLU model you can run on a 24 GB GPU?
Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.
What's the best MMLU model you can run on a 12 GB GPU?
Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.
Can open models match proprietary models on MMLU?
Not quite on MMLU: the strongest proprietary model (gpt-4o-2024-11-20) scores 88.1%, ahead of the best open model (Llama 3.3 70B Instruct) at 86.3% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.