Knowledge

MMLU Leaderboard

MMLU (Massive Multitask Language Understanding) spans 57 subjects from history to law to medicine as multiple-choice questions. It is the long-standing default for broad knowledge and remains the most widely-reported general benchmark.

Source: epoch36 open models ranked+100 proprietaryData through Feb 2025

Open models ranked on MMLU

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 7Llama 3.3 70B Instruct · 70.6B
86.3%
2 / 10Phi 4 · 14.7B
84.8%
3 / 13Llama 3.1 405B · 405B
84.4%
4 / 16Qwen2.5 72B Instruct · 72.7B
83.4%
5 / 23Llama 3.1 70B Instruct · 70.6B
80.1%
6 / 25Qwen2.5 14B Instruct · 14.8B
79.9%
7 / 28Meta Llama 3 70B Instruct · 70.6B
79.3%
8 / 42Gemma 2 27B IT · 27.2B
75.7%
9 / 44Qwen2.5 Coder 14B · 14.8B
75.2%
10 / 53Qwen2.5 7B Instruct · 7.6B
72.9%
11 / 55Gemma 2 9B IT · 9.2B
72.1%
12 / 58Falcon 180B · 180B
70.6%
13 / 62Llama 2 70B HF · 69.0B
69.9%
14 / 67Phi 3 Mini 4k Instruct · 3.8B
68.8%
15 / 72Qwen2.5 Coder 7B · 7.6B
68.0%
16 / 75Meta Llama 3 8B Instruct · 8.0B
66.5%
17 / 78Starcoder2 15B · 16.0B
64.1%
18 / 79Gemma 7B · 8.5B
63.6%
19 / 84Mistral 7B Instruct v0.2 · 7B
62.5%
20 / 86DeepSeek Coder v2 Lite Base · 15.7B
60.5%
21 / 88Llama 2 70B Chat · 70B
59.9%
22 / 89Mistral 7B Instruct v0.3 · 7.2B
59.9%
23 / 95Mistral 7B v0.1 · 7B
56.6%
24 / 97Phi 2 · 2.8B
56.3%
25 / 98Llama 3.1 8B Instruct · 8.0B
56.1%
26 / 103Qwen2.5 Coder 1.5B · 1.5B
53.6%
27 / 113Llama 2 7B · 7B
45.3%
28 / 117Gemma 2B · 2.5B
42.3%
29 / 118Qwen2.5 Coder 0.5B · 494M
42.0%
30 / 121Starcoder2 7B · 7.2B
38.8%
31 / 122Phi 1 5 · 1.4B
37.6%
32 / 125Llama 7B · 6.7B
35.2%
33 / 130Deepseek Coder 1.3B Base · 1.3B
25.8%
34 / 132GPT J 6B · 6B
25.7%
35 / 134Cerebras GPT 13B · 13B
24.6%
36 / 136Falcon 7B · 7.2B
23.9%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

1B10B100Bmodel size (log scale) →86.3%23.9%Llama 3.1 405B · 405B · 84.4%Qwen2.5 72B Instruct · 73B · 83.4%Llama 3.1 70B Instruct · 71B · 80.1%Qwen2.5 14B Instruct · 15B · 79.9%Meta Llama 3 70B Instruct · 71B · 79.3%Gemma 2 27B IT · 27B · 75.7%Qwen2.5 Coder 14B · 15B · 75.2%Gemma 2 9B IT · 9B · 72.1%Falcon 180B · 180B · 70.6%Llama 2 70B HF · 69B · 69.9%Qwen2.5 Coder 7B · 8B · 68.0%Meta Llama 3 8B Instruct · 8B · 66.5%Starcoder2 15B · 16B · 64.1%Gemma 7B · 9B · 63.6%Mistral 7B Instruct v0.2 · 7B · 62.5%DeepSeek Coder v2 Lite Base · 16B · 60.5%Llama 2 70B Chat · 70B · 59.9%Mistral 7B Instruct v0.3 · 7B · 59.9%Mistral 7B v0.1 · 7B · 56.6%Llama 3.1 8B Instruct · 8B · 56.1%Llama 2 7B · 7B · 45.3%Gemma 2B · 3B · 42.3%Starcoder2 7B · 7B · 38.8%Phi 1 5 · 1B · 37.6%Llama 7B · 7B · 35.2%Deepseek Coder 1.3B Base · 1B · 25.8%GPT J 6B · 6B · 25.7%Cerebras GPT 13B · 13B · 24.6%Falcon 7B · 7B · 23.9%Qwen2.5 Coder 0.5B · 494M · 42.0%Qwen2.5 Coder 0.5BQwen2.5 Coder 1.5B · 2B · 53.6%Phi 2 · 3B · 56.3%Phi 2Phi 3 Mini 4k Instruct · 4B · 68.8%Phi 3 Mini 4k InstructQwen2.5 7B Instruct · 8B · 72.9%Qwen2.5 7B InstructPhi 4 · 15B · 84.8%Phi 4Llama 3.3 70B Instruct · 71B · 86.3%Llama 3.3 70B Instruct
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Qwen2.5 Coder 0.5B, 494M, score 42.0% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 Coder 1.5B, 2B, score 53.6% — on the efficiency frontier (best score at its size or smaller).
  • Phi 2, 3B, score 56.3% — on the efficiency frontier (best score at its size or smaller).
  • Phi 3 Mini 4k Instruct, 4B, score 68.8% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 7B Instruct, 8B, score 72.9% — on the efficiency frontier (best score at its size or smaller).
  • Phi 4, 15B, score 84.8% — on the efficiency frontier (best score at its size or smaller).
  • Llama 3.3 70B Instruct, 71B, score 86.3% — on the efficiency frontier (best score at its size or smaller).

MMLU: frequently asked questions

What is the best open LLM on MMLU?
Llama 3.3 70B Instruct is the top open model on MMLU, scoring 86.3%. Among all models tested — including proprietary ones — it ranks #7.
What's the best MMLU model you can run on a 24 GB GPU?
Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.
What's the best MMLU model you can run on a 12 GB GPU?
Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.
Can open models match proprietary models on MMLU?
Not quite on MMLU: the strongest proprietary model (gpt-4o-2024-11-20) scores 88.1%, ahead of the best open model (Llama 3.3 70B Instruct) at 86.3% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.