Knowledge

HellaSwag Leaderboard

HellaSwag is a commonsense benchmark that asks a model to pick the most plausible continuation of an everyday situation. The wrong options are adversarially chosen to fool models, testing grounded commonsense understanding.

Source: epoch22 open models ranked+54 proprietaryData through Dec 2024

All models ranked on HellaSwag

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gpt-4-0314 · proprietary
95.3%
2gpt-4-32k-0314 · proprietary
95.3%
3Llama 3.1 405B · 405B
89.2%
4Falcon 180B · 180B
89.0%
5DeepSeek-V3 · proprietary
88.9%
6DeepSeek-V2 · proprietary
87.1%
7PaLM 2-L · proprietary
86.8%
8Llama 2 70B HF · 69.0B
85.3%
9falcon-40b · proprietary
85.3%
10Qwen2.5-72B · proprietary
84.8%
11Mixtral-8x7B-v0.1 · proprietary
84.4%
12LLaMA-65B · proprietary
84.2%
13StableBeluga2 · proprietary
84.1%
14PaLM 2-M · proprietary
84.0%
15PaLM 540B · proprietary
83.6%
16Qwen2.5-Coder-32B · proprietary
83.0%
17falcon-11b · proprietary
82.9%
18LLaMA-33B · proprietary
82.8%
19Nemotron-4 15B · proprietary
82.4%
20Phi-3-medium-128k-instruct · proprietary
82.4%
21text-davinci-003 · proprietary
82.2%
22PaLM 2-S · proprietary
82.0%
23text-davinci-002 · proprietary
81.5%
24Gemma 7B · 8.5B
81.2%
25Mistral 7B v0.1 · 7B
81.0%
26Llama-2-13b · proprietary
80.7%
27Megatron-Turing NLG 530B · proprietary
80.2%
28Qwen2.5 Coder 14B · 14.8B
80.2%
29Gopher (280B) · proprietary
79.2%
30LLaMA-13B · proprietary
79.2%
31opt-175b · proprietary
79.1%
32text-davinci-001 · proprietary
78.9%
33internlm-20b · proprietary
78.1%
34davinci · proprietary
77.5%
35GLaM (MoE) · proprietary
77.2%
36Phi-3-small-8k-instruct · proprietary
77.0%
37Qwen2.5 Coder 7B · 7.6B
76.8%
38Phi 3 Mini 4k Instruct · 3.8B
76.7%
39Falcon 7B · 7.2B
76.4%
40Yi-9B · proprietary
76.4%
41mpt-7b · proprietary
76.1%
42opt-66b · proprietary
74.5%
43Bloom · 176.2B
74.4%
44Yi-6B · proprietary
74.4%
45xgen-7b-8k-base · proprietary
74.2%
46open_llama_7b · proprietary
71.8%
47INTELLECT-1-Instruct · proprietary
71.4%
48Gemma 2B · 2.5B
71.4%
49Qwen2.5-Coder-3B · proprietary
70.9%
50Baichuan-2-13B-Base · proprietary
70.8%
51dolly-v2-12b · proprietary
70.8%
52internlm-7b · proprietary
70.6%
53GPT Neox 20B · 20.7B
70.5%
54RedPajama-INCITE-7B-Base · proprietary
70.3%
55opt-13b · proprietary
69.9%
56curie · proprietary
68.2%
57Baichuan-2-7B-Base · proprietary
68.0%
58text-curie-001 · proprietary
67.6%
59GPT J 6B · 6B
66.2%
60Qwen2.5 Coder 1.5B · 1.5B
61.8%
61Cerebras GPT 13B · 13B
59.4%
62vicuna-13b-v1.1 · proprietary
57.8%
63Llama 2 7B · 7B
57.1%
64chatglm2-6b · proprietary
57.0%
65Llama 7B · 6.7B
56.2%
66text-babbage-001 · proprietary
56.1%
67babbage · proprietary
55.5%
68Phi 2 · 2.8B
53.6%
69Qwen2.5 Coder 0.5B · 494M
48.4%
70Phi 1 5 · 1.4B
47.6%
71ada · proprietary
43.5%
72text-ada-001 · proprietary
42.9%
73gpt-neo-2.7B · proprietary
42.7%
74opt-1.3b · proprietary
41.5%
75Stablelm Tuned Alpha 7B · 7B
40.7%
76Gpt2 Xl · 1.6B
40.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

1B10B100Bmodel size (log scale) →89.2%40.0%Qwen2.5 Coder 14B · 15B · 80.2%Qwen2.5 Coder 7B · 8B · 76.8%Falcon 7B · 7B · 76.4%Bloom · 176B · 74.4%GPT Neox 20B · 21B · 70.5%GPT J 6B · 6B · 66.2%Cerebras GPT 13B · 13B · 59.4%Llama 2 7B · 7B · 57.1%Llama 7B · 7B · 56.2%Phi 2 · 3B · 53.6%Phi 1 5 · 1B · 47.6%Stablelm Tuned Alpha 7B · 7B · 40.7%Gpt2 Xl · 2B · 40.0%Qwen2.5 Coder 0.5B · 494M · 48.4%Qwen2.5 Coder 0.5BQwen2.5 Coder 1.5B · 2B · 61.8%Qwen2.5 Coder 1.5BGemma 2B · 3B · 71.4%Gemma 2BPhi 3 Mini 4k Instruct · 4B · 76.7%Phi 3 Mini 4k InstructMistral 7B v0.1 · 7B · 81.0%Gemma 7B · 9B · 81.2%Gemma 7BLlama 2 70B HF · 69B · 85.3%Llama 2 70B HFFalcon 180B · 180B · 89.0%Falcon 180BLlama 3.1 405B · 405B · 89.2%Llama 3.1 405B
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Qwen2.5 Coder 0.5B, 494M, score 48.4% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 Coder 1.5B, 2B, score 61.8% — on the efficiency frontier (best score at its size or smaller).
  • Gemma 2B, 3B, score 71.4% — on the efficiency frontier (best score at its size or smaller).
  • Phi 3 Mini 4k Instruct, 4B, score 76.7% — on the efficiency frontier (best score at its size or smaller).
  • Mistral 7B v0.1, 7B, score 81.0% — on the efficiency frontier (best score at its size or smaller).
  • Gemma 7B, 9B, score 81.2% — on the efficiency frontier (best score at its size or smaller).
  • Llama 2 70B HF, 69B, score 85.3% — on the efficiency frontier (best score at its size or smaller).
  • Falcon 180B, 180B, score 89.0% — on the efficiency frontier (best score at its size or smaller).
  • Llama 3.1 405B, 405B, score 89.2% — on the efficiency frontier (best score at its size or smaller).

HellaSwag: frequently asked questions

What is the best open LLM on HellaSwag?
Llama 3.1 405B is the top open model on HellaSwag, scoring 89.2%. Among all models tested — including proprietary ones — it ranks #3.
What's the best HellaSwag model you can run on a 24 GB GPU?
Gemma 7B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 5 GB), scoring 81.2% on HellaSwag.
What's the best HellaSwag model you can run on a 12 GB GPU?
Gemma 7B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 81.2% on HellaSwag.
Can open models match proprietary models on HellaSwag?
Not quite on HellaSwag: the strongest proprietary model (gpt-4-0314) scores 95.3%, ahead of the best open model (Llama 3.1 405B) at 89.2% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.