Knowledge

HellaSwag Leaderboard

HellaSwag is a commonsense benchmark that asks a model to pick the most plausible continuation of an everyday situation. The wrong options are adversarially chosen to fool models, testing grounded commonsense understanding.

Source: epoch22 open models ranked+54 proprietaryData through Dec 2024

Open models ranked on HellaSwag

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 3Llama 3.1 405B · 405B
89.2%
2 / 4Falcon 180B · 180B
89.0%
3 / 8Llama 2 70B HF · 69.0B
85.3%
4 / 24Gemma 7B · 8.5B
81.2%
5 / 25Mistral 7B v0.1 · 7B
81.0%
6 / 28Qwen2.5 Coder 14B · 14.8B
80.2%
7 / 37Qwen2.5 Coder 7B · 7.6B
76.8%
8 / 38Phi 3 Mini 4k Instruct · 3.8B
76.7%
9 / 39Falcon 7B · 7.2B
76.4%
10 / 43Bloom · 176.2B
74.4%
11 / 48Gemma 2B · 2.5B
71.4%
12 / 53GPT Neox 20B · 20.7B
70.5%
13 / 59GPT J 6B · 6B
66.2%
14 / 60Qwen2.5 Coder 1.5B · 1.5B
61.8%
15 / 61Cerebras GPT 13B · 13B
59.4%
16 / 63Llama 2 7B · 7B
57.1%
17 / 65Llama 7B · 6.7B
56.2%
18 / 68Phi 2 · 2.8B
53.6%
19 / 69Qwen2.5 Coder 0.5B · 494M
48.4%
20 / 70Phi 1 5 · 1.4B
47.6%
21 / 75Stablelm Tuned Alpha 7B · 7B
40.7%
22 / 76Gpt2 Xl · 1.6B
40.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

1B10B100Bmodel size (log scale) →89.2%40.0%Qwen2.5 Coder 14B · 15B · 80.2%Qwen2.5 Coder 7B · 8B · 76.8%Falcon 7B · 7B · 76.4%Bloom · 176B · 74.4%GPT Neox 20B · 21B · 70.5%GPT J 6B · 6B · 66.2%Cerebras GPT 13B · 13B · 59.4%Llama 2 7B · 7B · 57.1%Llama 7B · 7B · 56.2%Phi 2 · 3B · 53.6%Phi 1 5 · 1B · 47.6%Stablelm Tuned Alpha 7B · 7B · 40.7%Gpt2 Xl · 2B · 40.0%Qwen2.5 Coder 0.5B · 494M · 48.4%Qwen2.5 Coder 0.5BQwen2.5 Coder 1.5B · 2B · 61.8%Qwen2.5 Coder 1.5BGemma 2B · 3B · 71.4%Gemma 2BPhi 3 Mini 4k Instruct · 4B · 76.7%Phi 3 Mini 4k InstructMistral 7B v0.1 · 7B · 81.0%Gemma 7B · 9B · 81.2%Gemma 7BLlama 2 70B HF · 69B · 85.3%Llama 2 70B HFFalcon 180B · 180B · 89.0%Falcon 180BLlama 3.1 405B · 405B · 89.2%Llama 3.1 405B
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Qwen2.5 Coder 0.5B, 494M, score 48.4% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 Coder 1.5B, 2B, score 61.8% — on the efficiency frontier (best score at its size or smaller).
  • Gemma 2B, 3B, score 71.4% — on the efficiency frontier (best score at its size or smaller).
  • Phi 3 Mini 4k Instruct, 4B, score 76.7% — on the efficiency frontier (best score at its size or smaller).
  • Mistral 7B v0.1, 7B, score 81.0% — on the efficiency frontier (best score at its size or smaller).
  • Gemma 7B, 9B, score 81.2% — on the efficiency frontier (best score at its size or smaller).
  • Llama 2 70B HF, 69B, score 85.3% — on the efficiency frontier (best score at its size or smaller).
  • Falcon 180B, 180B, score 89.0% — on the efficiency frontier (best score at its size or smaller).
  • Llama 3.1 405B, 405B, score 89.2% — on the efficiency frontier (best score at its size or smaller).

HellaSwag: frequently asked questions

What is the best open LLM on HellaSwag?
Llama 3.1 405B is the top open model on HellaSwag, scoring 89.2%. Among all models tested — including proprietary ones — it ranks #3.
What's the best HellaSwag model you can run on a 24 GB GPU?
Gemma 7B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 5 GB), scoring 81.2% on HellaSwag.
What's the best HellaSwag model you can run on a 12 GB GPU?
Gemma 7B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 81.2% on HellaSwag.
Can open models match proprietary models on HellaSwag?
Not quite on HellaSwag: the strongest proprietary model (gpt-4-0314) scores 95.3%, ahead of the best open model (Llama 3.1 405B) at 89.2% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.