Knowledge
MMLU Leaderboard
MMLU (Massive Multitask Language Understanding) spans 57 subjects from history to law to medicine as multiple-choice questions. It is the long-standing default for broad knowledge and remains the most widely-reported general benchmark.
Source: epoch36 open models ranked+100 proprietaryData through Feb 2025
Open models ranked on MMLU
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 7 | Llama 3.3 70B Instruct · 70.6B | 86.3% |
| 2 / 10 | Phi 4 · 14.7B | 84.8% |
| 3 / 13 | Llama 3.1 405B · 405B | 84.4% |
| 4 / 16 | Qwen2.5 72B Instruct · 72.7B | 83.4% |
| 5 / 23 | Llama 3.1 70B Instruct · 70.6B | 80.1% |
| 6 / 25 | Qwen2.5 14B Instruct · 14.8B | 79.9% |
| 7 / 28 | Meta Llama 3 70B Instruct · 70.6B | 79.3% |
| 8 / 42 | Gemma 2 27B IT · 27.2B | 75.7% |
| 9 / 44 | Qwen2.5 Coder 14B · 14.8B | 75.2% |
| 10 / 53 | Qwen2.5 7B Instruct · 7.6B | 72.9% |
| 11 / 55 | Gemma 2 9B IT · 9.2B | 72.1% |
| 12 / 58 | Falcon 180B · 180B | 70.6% |
| 13 / 62 | Llama 2 70B HF · 69.0B | 69.9% |
| 14 / 67 | Phi 3 Mini 4k Instruct · 3.8B | 68.8% |
| 15 / 72 | Qwen2.5 Coder 7B · 7.6B | 68.0% |
| 16 / 75 | Meta Llama 3 8B Instruct · 8.0B | 66.5% |
| 17 / 78 | Starcoder2 15B · 16.0B | 64.1% |
| 18 / 79 | Gemma 7B · 8.5B | 63.6% |
| 19 / 84 | Mistral 7B Instruct v0.2 · 7B | 62.5% |
| 20 / 86 | DeepSeek Coder v2 Lite Base · 15.7B | 60.5% |
| 21 / 88 | Llama 2 70B Chat · 70B | 59.9% |
| 22 / 89 | Mistral 7B Instruct v0.3 · 7.2B | 59.9% |
| 23 / 95 | Mistral 7B v0.1 · 7B | 56.6% |
| 24 / 97 | Phi 2 · 2.8B | 56.3% |
| 25 / 98 | Llama 3.1 8B Instruct · 8.0B | 56.1% |
| 26 / 103 | Qwen2.5 Coder 1.5B · 1.5B | 53.6% |
| 27 / 113 | Llama 2 7B · 7B | 45.3% |
| 28 / 117 | Gemma 2B · 2.5B | 42.3% |
| 29 / 118 | Qwen2.5 Coder 0.5B · 494M | 42.0% |
| 30 / 121 | Starcoder2 7B · 7.2B | 38.8% |
| 31 / 122 | Phi 1 5 · 1.4B | 37.6% |
| 32 / 125 | Llama 7B · 6.7B | 35.2% |
| 33 / 130 | Deepseek Coder 1.3B Base · 1.3B | 25.8% |
| 34 / 132 | GPT J 6B · 6B | 25.7% |
| 35 / 134 | Cerebras GPT 13B · 13B | 24.6% |
| 36 / 136 | Falcon 7B · 7.2B | 23.9% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Qwen2.5 Coder 0.5B, 494M, score 42.0% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 Coder 1.5B, 2B, score 53.6% — on the efficiency frontier (best score at its size or smaller).
- Phi 2, 3B, score 56.3% — on the efficiency frontier (best score at its size or smaller).
- Phi 3 Mini 4k Instruct, 4B, score 68.8% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 7B Instruct, 8B, score 72.9% — on the efficiency frontier (best score at its size or smaller).
- Phi 4, 15B, score 84.8% — on the efficiency frontier (best score at its size or smaller).
- Llama 3.3 70B Instruct, 71B, score 86.3% — on the efficiency frontier (best score at its size or smaller).
MMLU: frequently asked questions
- What is the best open LLM on MMLU?
- Llama 3.3 70B Instruct is the top open model on MMLU, scoring 86.3%. Among all models tested — including proprietary ones — it ranks #7.
- What's the best MMLU model you can run on a 24 GB GPU?
- Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.
- What's the best MMLU model you can run on a 12 GB GPU?
- Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.
- Can open models match proprietary models on MMLU?
- Not quite on MMLU: the strongest proprietary model (gpt-4o-2024-11-20) scores 88.1%, ahead of the best open model (Llama 3.3 70B Instruct) at 86.3% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.