Knowledge
HellaSwag Leaderboard
HellaSwag is a commonsense benchmark that asks a model to pick the most plausible continuation of an everyday situation. The wrong options are adversarially chosen to fool models, testing grounded commonsense understanding.
Source: epoch22 open models ranked+54 proprietaryData through Dec 2024
Open models ranked on HellaSwag
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 3 | Llama 3.1 405B · 405B | 89.2% |
| 2 / 4 | Falcon 180B · 180B | 89.0% |
| 3 / 8 | Llama 2 70B HF · 69.0B | 85.3% |
| 4 / 24 | Gemma 7B · 8.5B | 81.2% |
| 5 / 25 | Mistral 7B v0.1 · 7B | 81.0% |
| 6 / 28 | Qwen2.5 Coder 14B · 14.8B | 80.2% |
| 7 / 37 | Qwen2.5 Coder 7B · 7.6B | 76.8% |
| 8 / 38 | Phi 3 Mini 4k Instruct · 3.8B | 76.7% |
| 9 / 39 | Falcon 7B · 7.2B | 76.4% |
| 10 / 43 | Bloom · 176.2B | 74.4% |
| 11 / 48 | Gemma 2B · 2.5B | 71.4% |
| 12 / 53 | GPT Neox 20B · 20.7B | 70.5% |
| 13 / 59 | GPT J 6B · 6B | 66.2% |
| 14 / 60 | Qwen2.5 Coder 1.5B · 1.5B | 61.8% |
| 15 / 61 | Cerebras GPT 13B · 13B | 59.4% |
| 16 / 63 | Llama 2 7B · 7B | 57.1% |
| 17 / 65 | Llama 7B · 6.7B | 56.2% |
| 18 / 68 | Phi 2 · 2.8B | 53.6% |
| 19 / 69 | Qwen2.5 Coder 0.5B · 494M | 48.4% |
| 20 / 70 | Phi 1 5 · 1.4B | 47.6% |
| 21 / 75 | Stablelm Tuned Alpha 7B · 7B | 40.7% |
| 22 / 76 | Gpt2 Xl · 1.6B | 40.0% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Qwen2.5 Coder 0.5B, 494M, score 48.4% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 Coder 1.5B, 2B, score 61.8% — on the efficiency frontier (best score at its size or smaller).
- Gemma 2B, 3B, score 71.4% — on the efficiency frontier (best score at its size or smaller).
- Phi 3 Mini 4k Instruct, 4B, score 76.7% — on the efficiency frontier (best score at its size or smaller).
- Mistral 7B v0.1, 7B, score 81.0% — on the efficiency frontier (best score at its size or smaller).
- Gemma 7B, 9B, score 81.2% — on the efficiency frontier (best score at its size or smaller).
- Llama 2 70B HF, 69B, score 85.3% — on the efficiency frontier (best score at its size or smaller).
- Falcon 180B, 180B, score 89.0% — on the efficiency frontier (best score at its size or smaller).
- Llama 3.1 405B, 405B, score 89.2% — on the efficiency frontier (best score at its size or smaller).
HellaSwag: frequently asked questions
- What is the best open LLM on HellaSwag?
- Llama 3.1 405B is the top open model on HellaSwag, scoring 89.2%. Among all models tested — including proprietary ones — it ranks #3.
- What's the best HellaSwag model you can run on a 24 GB GPU?
- Gemma 7B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 5 GB), scoring 81.2% on HellaSwag.
- What's the best HellaSwag model you can run on a 12 GB GPU?
- Gemma 7B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 81.2% on HellaSwag.
- Can open models match proprietary models on HellaSwag?
- Not quite on HellaSwag: the strongest proprietary model (gpt-4-0314) scores 95.3%, ahead of the best open model (Llama 3.1 405B) at 89.2% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.