Knowledge
MMLU Leaderboard
MMLU (Massive Multitask Language Understanding) spans 57 subjects from history to law to medicine as multiple-choice questions. It is the long-standing default for broad knowledge and remains the most widely-reported general benchmark.
Source: epoch36 open models ranked+100 proprietaryData through Feb 2025
All models ranked on MMLU
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | gpt-4o-2024-11-20 · proprietary | 88.1% |
| 2 | claude-3-5-sonnet-20241022 · proprietary | 87.3% |
| 3 | DeepSeek-V3 · proprietary | 87.1% |
| 4 | gemini-1.5-pro-002 · proprietary | 86.9% |
| 5 | claude-3-5-sonnet-20240620 · proprietary | 86.5% |
| 6 | gpt-4-0314 · proprietary | 86.4% |
| 7 | Llama 3.3 70B Instruct · 70.6B | 86.3% |
| 8 | gemini-1.5-pro-001 · proprietary | 85.9% |
| 9 | Qwen2.5-72B · proprietary | 85.0% |
| 10 | Phi 4 · 14.7B | 84.8% |
| 11 | claude-3-opus-20240229 · proprietary | 84.6% |
| 12 | Llama-3.1-405B-Instruct · proprietary | 84.5% |
| 13 | Llama 3.1 405B · 405B | 84.4% |
| 14 | gpt-4o-2024-08-06 · proprietary | 84.3% |
| 15 | gpt-4o-2024-05-13 · proprietary | 84.2% |
| 16 | Qwen2.5 72B Instruct · 72.7B | 83.4% |
| 17 | gpt-4-0613 · proprietary | 82.4% |
| 18 | qwen2-72b-instruct · proprietary | 82.4% |
| 19 | amazon.nova-pro-v1:0 · proprietary | 82.0% |
| 20 | gemini-1.5-pro-001-feb24 · proprietary | 81.9% |
| 21 | gpt-4-turbo-2024-04-09 · proprietary | 81.3% |
| 22 | Llama-3.2-90B-Vision-Instruct · proprietary | 80.3% |
| 23 | Llama 3.1 70B Instruct · 70.6B | 80.1% |
| 24 | mistral-large-2407 · proprietary | 80.0% |
| 25 | Qwen2.5 14B Instruct · 14.8B | 79.9% |
| 26 | gemini-2.0-flash-exp · proprietary | 79.7% |
| 27 | gpt-4-turbo · proprietary | 79.6% |
| 28 | Meta Llama 3 70B Instruct · 70.6B | 79.3% |
| 29 | Yi-large · proprietary | 79.3% |
| 30 | Qwen2.5-Coder-32B · proprietary | 79.1% |
| 31 | claude-2.0 · proprietary | 78.5% |
| 32 | DeepSeek-V2 · proprietary | 78.4% |
| 33 | Phi-3-medium-128k-instruct · proprietary | 78.0% |
| 34 | gemini-1.5-flash-001 · proprietary | 77.9% |
| 35 | gemini-1.5-flash-0514 · proprietary | 77.8% |
| 36 | Mixtral-8x22B-v0.1 · proprietary | 77.8% |
| 37 | amazon.nova-lite-v1:0 · proprietary | 77.0% |
| 38 | claude-1.3 · proprietary | 77.0% |
| 39 | gpt-4o-mini-2024-07-18 · proprietary | 76.7% |
| 40 | Yi-34B · proprietary | 76.3% |
| 41 | claude-3-sonnet-20240229 · proprietary | 75.9% |
| 42 | Gemma 2 27B IT · 27.2B | 75.7% |
| 43 | Phi-3-small-8k-instruct · proprietary | 75.7% |
| 44 | Qwen2.5 Coder 14B · 14.8B | 75.2% |
| 45 | qwen1.5-32B · proprietary | 74.4% |
| 46 | claude-3-5-haiku-20241022 · proprietary | 74.3% |
| 47 | gemini-1.5-flash-002 · proprietary | 73.9% |
| 48 | claude-3-haiku-20240307 · proprietary | 73.8% |
| 49 | claude-2.1 · proprietary | 73.5% |
| 50 | Yi-34B-Chat · proprietary | 73.5% |
| 51 | claude-instant-1.1 · proprietary | 73.4% |
| 52 | claude-instant-1.2 · proprietary | 73.2% |
| 53 | Qwen2.5 7B Instruct · 7.6B | 72.9% |
| 54 | Inflection-1 · proprietary | 72.7% |
| 55 | Gemma 2 9B IT · 9.2B | 72.1% |
| 56 | gpt-3.5-turbo-1106 · proprietary | 71.4% |
| 57 | amazon.nova-micro-v1:0 · proprietary | 70.8% |
| 58 | Falcon 180B · 180B | 70.6% |
| 59 | Mixtral-8x7B-v0.1 · proprietary | 70.5% |
| 60 | gemini-1.0-pro-001 · proprietary | 70.0% |
| 61 | text-davinci-002 · proprietary | 70.0% |
| 62 | Llama 2 70B HF · 69.0B | 69.9% |
| 63 | c4ai-command-r-plus-08-2024 · proprietary | 69.4% |
| 64 | PaLM 540B · proprietary | 69.3% |
| 65 | gpt-3.5-turbo-0613 · proprietary | 68.9% |
| 66 | mistral-large-2402 · proprietary | 68.8% |
| 67 | Phi 3 Mini 4k Instruct · 3.8B | 68.8% |
| 68 | mistral-small-2402 · proprietary | 68.7% |
| 69 | qwen1.5-14B · proprietary | 68.6% |
| 70 | StableBeluga2 · proprietary | 68.6% |
| 71 | Yi-9B · proprietary | 68.4% |
| 72 | Qwen2.5 Coder 7B · 7.6B | 68.0% |
| 73 | Chinchilla (70B) · proprietary | 67.5% |
| 74 | gpt-3.5-turbo-0125 · proprietary | 67.3% |
| 75 | Meta Llama 3 8B Instruct · 8.0B | 66.5% |
| 76 | c4ai-command-r-08-2024 · proprietary | 65.2% |
| 77 | Qwen-14B-Chat · proprietary | 65.0% |
| 78 | Starcoder2 15B · 16.0B | 64.1% |
| 79 | Gemma 7B · 8.5B | 63.6% |
| 80 | LLaMA-65B · proprietary | 63.4% |
| 81 | Yi-6B · proprietary | 63.2% |
| 82 | Llama-2-34b · proprietary | 62.6% |
| 83 | Qwen1.5-7B · proprietary | 62.6% |
| 84 | Mistral 7B Instruct v0.2 · 7B | 62.5% |
| 85 | Yi-6B-Chat · proprietary | 61.0% |
| 86 | DeepSeek Coder v2 Lite Base · 15.7B | 60.5% |
| 87 | Gopher (280B) · proprietary | 60.0% |
| 88 | Llama 2 70B Chat · 70B | 59.9% |
| 89 | Mistral 7B Instruct v0.3 · 7.2B | 59.9% |
| 90 | Baichuan-2-13B-Base · proprietary | 59.2% |
| 91 | Nemotron-4 15B · proprietary | 58.7% |
| 92 | falcon-11b · proprietary | 58.4% |
| 93 | LLaMA-33B · proprietary | 57.8% |
| 94 | internlm-chat-20b · proprietary | 57.4% |
| 95 | Mistral 7B v0.1 · 7B | 56.6% |
| 96 | Llama-3.2-11B-Vision-Instruct · proprietary | 56.5% |
| 97 | Phi 2 · 2.8B | 56.3% |
| 98 | Llama 3.1 8B Instruct · 8.0B | 56.1% |
| 99 | falcon-40b · proprietary | 55.4% |
| 100 | Llama-2-13b · proprietary | 54.8% |
| 101 | Baichuan-2-7B-Base · proprietary | 54.2% |
| 102 | PaLM 62B · proprietary | 53.7% |
| 103 | Qwen2.5 Coder 1.5B · 1.5B | 53.6% |
| 104 | Qwen-14B · proprietary | 53.4% |
| 105 | Baichuan-13B-Base · proprietary | 51.6% |
| 106 | internlm-7b · proprietary | 51.0% |
| 107 | Baichuan2-13B-Chat · proprietary | 50.1% |
| 108 | INTELLECT-1-Instruct · proprietary | 49.9% |
| 109 | chatglm2-6b · proprietary | 47.9% |
| 110 | Llama-2-13b-chat · proprietary | 47.3% |
| 111 | mpt-30b · proprietary | 46.9% |
| 112 | LLaMA-13B · proprietary | 46.3% |
| 113 | Llama 2 7B · 7B | 45.3% |
| 114 | Qwen-7B · proprietary | 45.0% |
| 115 | text-davinci-001 · proprietary | 43.9% |
| 116 | Baichuan-7B · proprietary | 42.3% |
| 117 | Gemma 2B · 2.5B | 42.3% |
| 118 | Qwen2.5 Coder 0.5B · 494M | 42.0% |
| 119 | CodeQwen1.5-7B · proprietary | 40.5% |
| 120 | deepseek-coder-33b-base · proprietary | 39.4% |
| 121 | Starcoder2 7B · 7.2B | 38.8% |
| 122 | Phi 1 5 · 1.4B | 37.6% |
| 123 | starcoder2-3b · proprietary | 36.6% |
| 124 | deepseek-coder-6.7b-base · proprietary | 36.4% |
| 125 | Llama 7B · 6.7B | 35.2% |
| 126 | xgen-7b-8k-base · proprietary | 32.1% |
| 127 | open_llama_7b · proprietary | 28.6% |
| 128 | Qwen-1_8B · proprietary | 28.2% |
| 129 | mpt-7b · proprietary | 27.4% |
| 130 | Deepseek Coder 1.3B Base · 1.3B | 25.8% |
| 131 | RedPajama-INCITE-7B-Base · proprietary | 25.8% |
| 132 | GPT J 6B · 6B | 25.7% |
| 133 | dolly-v2-12b · proprietary | 25.4% |
| 134 | Cerebras GPT 13B · 13B | 24.6% |
| 135 | opt-13b · proprietary | 24.4% |
| 136 | Falcon 7B · 7.2B | 23.9% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Qwen2.5 Coder 0.5B, 494M, score 42.0% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 Coder 1.5B, 2B, score 53.6% — on the efficiency frontier (best score at its size or smaller).
- Phi 2, 3B, score 56.3% — on the efficiency frontier (best score at its size or smaller).
- Phi 3 Mini 4k Instruct, 4B, score 68.8% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 7B Instruct, 8B, score 72.9% — on the efficiency frontier (best score at its size or smaller).
- Phi 4, 15B, score 84.8% — on the efficiency frontier (best score at its size or smaller).
- Llama 3.3 70B Instruct, 71B, score 86.3% — on the efficiency frontier (best score at its size or smaller).
MMLU: frequently asked questions
- What is the best open LLM on MMLU?
- Llama 3.3 70B Instruct is the top open model on MMLU, scoring 86.3%. Among all models tested — including proprietary ones — it ranks #7.
- What's the best MMLU model you can run on a 24 GB GPU?
- Phi 4 is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.
- What's the best MMLU model you can run on a 12 GB GPU?
- Phi 4 is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 84.8% on MMLU.
- Can open models match proprietary models on MMLU?
- Not quite on MMLU: the strongest proprietary model (gpt-4o-2024-11-20) scores 88.1%, ahead of the best open model (Llama 3.3 70B Instruct) at 86.3% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.