Knowledge
HellaSwag Leaderboard
HellaSwag is a commonsense benchmark that asks a model to pick the most plausible continuation of an everyday situation. The wrong options are adversarially chosen to fool models, testing grounded commonsense understanding.
Source: epoch22 open models ranked+54 proprietaryData through Dec 2024
All models ranked on HellaSwag
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | gpt-4-0314 · proprietary | 95.3% |
| 2 | gpt-4-32k-0314 · proprietary | 95.3% |
| 3 | Llama 3.1 405B · 405B | 89.2% |
| 4 | Falcon 180B · 180B | 89.0% |
| 5 | DeepSeek-V3 · proprietary | 88.9% |
| 6 | DeepSeek-V2 · proprietary | 87.1% |
| 7 | PaLM 2-L · proprietary | 86.8% |
| 8 | Llama 2 70B HF · 69.0B | 85.3% |
| 9 | falcon-40b · proprietary | 85.3% |
| 10 | Qwen2.5-72B · proprietary | 84.8% |
| 11 | Mixtral-8x7B-v0.1 · proprietary | 84.4% |
| 12 | LLaMA-65B · proprietary | 84.2% |
| 13 | StableBeluga2 · proprietary | 84.1% |
| 14 | PaLM 2-M · proprietary | 84.0% |
| 15 | PaLM 540B · proprietary | 83.6% |
| 16 | Qwen2.5-Coder-32B · proprietary | 83.0% |
| 17 | falcon-11b · proprietary | 82.9% |
| 18 | LLaMA-33B · proprietary | 82.8% |
| 19 | Nemotron-4 15B · proprietary | 82.4% |
| 20 | Phi-3-medium-128k-instruct · proprietary | 82.4% |
| 21 | text-davinci-003 · proprietary | 82.2% |
| 22 | PaLM 2-S · proprietary | 82.0% |
| 23 | text-davinci-002 · proprietary | 81.5% |
| 24 | Gemma 7B · 8.5B | 81.2% |
| 25 | Mistral 7B v0.1 · 7B | 81.0% |
| 26 | Llama-2-13b · proprietary | 80.7% |
| 27 | Megatron-Turing NLG 530B · proprietary | 80.2% |
| 28 | Qwen2.5 Coder 14B · 14.8B | 80.2% |
| 29 | Gopher (280B) · proprietary | 79.2% |
| 30 | LLaMA-13B · proprietary | 79.2% |
| 31 | opt-175b · proprietary | 79.1% |
| 32 | text-davinci-001 · proprietary | 78.9% |
| 33 | internlm-20b · proprietary | 78.1% |
| 34 | davinci · proprietary | 77.5% |
| 35 | GLaM (MoE) · proprietary | 77.2% |
| 36 | Phi-3-small-8k-instruct · proprietary | 77.0% |
| 37 | Qwen2.5 Coder 7B · 7.6B | 76.8% |
| 38 | Phi 3 Mini 4k Instruct · 3.8B | 76.7% |
| 39 | Falcon 7B · 7.2B | 76.4% |
| 40 | Yi-9B · proprietary | 76.4% |
| 41 | mpt-7b · proprietary | 76.1% |
| 42 | opt-66b · proprietary | 74.5% |
| 43 | Bloom · 176.2B | 74.4% |
| 44 | Yi-6B · proprietary | 74.4% |
| 45 | xgen-7b-8k-base · proprietary | 74.2% |
| 46 | open_llama_7b · proprietary | 71.8% |
| 47 | INTELLECT-1-Instruct · proprietary | 71.4% |
| 48 | Gemma 2B · 2.5B | 71.4% |
| 49 | Qwen2.5-Coder-3B · proprietary | 70.9% |
| 50 | Baichuan-2-13B-Base · proprietary | 70.8% |
| 51 | dolly-v2-12b · proprietary | 70.8% |
| 52 | internlm-7b · proprietary | 70.6% |
| 53 | GPT Neox 20B · 20.7B | 70.5% |
| 54 | RedPajama-INCITE-7B-Base · proprietary | 70.3% |
| 55 | opt-13b · proprietary | 69.9% |
| 56 | curie · proprietary | 68.2% |
| 57 | Baichuan-2-7B-Base · proprietary | 68.0% |
| 58 | text-curie-001 · proprietary | 67.6% |
| 59 | GPT J 6B · 6B | 66.2% |
| 60 | Qwen2.5 Coder 1.5B · 1.5B | 61.8% |
| 61 | Cerebras GPT 13B · 13B | 59.4% |
| 62 | vicuna-13b-v1.1 · proprietary | 57.8% |
| 63 | Llama 2 7B · 7B | 57.1% |
| 64 | chatglm2-6b · proprietary | 57.0% |
| 65 | Llama 7B · 6.7B | 56.2% |
| 66 | text-babbage-001 · proprietary | 56.1% |
| 67 | babbage · proprietary | 55.5% |
| 68 | Phi 2 · 2.8B | 53.6% |
| 69 | Qwen2.5 Coder 0.5B · 494M | 48.4% |
| 70 | Phi 1 5 · 1.4B | 47.6% |
| 71 | ada · proprietary | 43.5% |
| 72 | text-ada-001 · proprietary | 42.9% |
| 73 | gpt-neo-2.7B · proprietary | 42.7% |
| 74 | opt-1.3b · proprietary | 41.5% |
| 75 | Stablelm Tuned Alpha 7B · 7B | 40.7% |
| 76 | Gpt2 Xl · 1.6B | 40.0% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Qwen2.5 Coder 0.5B, 494M, score 48.4% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 Coder 1.5B, 2B, score 61.8% — on the efficiency frontier (best score at its size or smaller).
- Gemma 2B, 3B, score 71.4% — on the efficiency frontier (best score at its size or smaller).
- Phi 3 Mini 4k Instruct, 4B, score 76.7% — on the efficiency frontier (best score at its size or smaller).
- Mistral 7B v0.1, 7B, score 81.0% — on the efficiency frontier (best score at its size or smaller).
- Gemma 7B, 9B, score 81.2% — on the efficiency frontier (best score at its size or smaller).
- Llama 2 70B HF, 69B, score 85.3% — on the efficiency frontier (best score at its size or smaller).
- Falcon 180B, 180B, score 89.0% — on the efficiency frontier (best score at its size or smaller).
- Llama 3.1 405B, 405B, score 89.2% — on the efficiency frontier (best score at its size or smaller).
HellaSwag: frequently asked questions
- What is the best open LLM on HellaSwag?
- Llama 3.1 405B is the top open model on HellaSwag, scoring 89.2%. Among all models tested — including proprietary ones — it ranks #3.
- What's the best HellaSwag model you can run on a 24 GB GPU?
- Gemma 7B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 5 GB), scoring 81.2% on HellaSwag.
- What's the best HellaSwag model you can run on a 12 GB GPU?
- Gemma 7B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 81.2% on HellaSwag.
- Can open models match proprietary models on HellaSwag?
- Not quite on HellaSwag: the strongest proprietary model (gpt-4-0314) scores 95.3%, ahead of the best open model (Llama 3.1 405B) at 89.2% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.