What is the best open LLM on HellaSwag?

Llama 3.1 405B is the top open model on HellaSwag, scoring 89.2%. Among all models tested — including proprietary ones — it ranks #3. The top model overall is GPT 4 (Mar 14) (OpenAI) at 95.3%.

What's the best HellaSwag model you can run on a 24 GB GPU?

Falcon 40B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 23 GB), scoring 85.3% on HellaSwag.

What's the best HellaSwag model you can run on a 12 GB GPU?

Falcon 11B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 6 GB), scoring 82.9% on HellaSwag.

Can open models match proprietary models on HellaSwag?

Not quite on HellaSwag: the strongest proprietary model (GPT 4 (Mar 14)) scores 95.3%, ahead of the best open model (Llama 3.1 405B) at 89.2% — but you can run the open one yourself.

Knowledge

HellaSwag Leaderboard

Name: HellaSwag — open LLM scores
Creator: epoch

HellaSwag is a commonsense benchmark that asks a model to pick the most plausible continuation of an everyday situation. The wrong options are adversarially chosen to fool models, testing grounded commonsense understanding.

Source: epoch42 open models ranked+34 proprietaryData through Dec 2024

Open models All models

Open models ranked on HellaSwag

# shows rank among open models / rank overall (including proprietary).

#	Model	Score
1 / 3	Llama 3.1 405B · 405.9B	89.2%
2 / 4	Falcon 180B · 180B	89.0%
3 / 5	DeepSeek v3 · 684.5B	88.9%
4 / 6	DeepSeek v2 · 235.7B	87.1%
5 / 8	Mixtral 8x7B v0.1 · 46.7B	86.7%
6 / 10	Llama 2 70B HF · 69.0B	85.3%
7 / 11	Falcon 40B · 41.8B	85.3%
8 / 12	Qwen2.5 72B · 72.7B	84.8%
9 / 14	StableBeluga2 · 70B	84.1%
10 / 17	Qwen2.5 Coder 32B · 32.8B	83.0%
11 / 18	Falcon 11B · 11.1B	82.9%
12 / 23	Gemma 7B · 8.5B	82.2%
13 / 26	Mistral 7B v0.1 · 7B	81.0%
14 / 27	Llama 2 13B HF · 13.0B	80.7%
15 / 28	Qwen2.5 Coder 14B · 14.8B	80.2%
16 / 33	Falcon 7B · 7.2B	78.1%
17 / 34	Internlm 20B · 20B	78.1%
18 / 37	Llama 2 7B HF · 6.7B	77.2%
19 / 38	Phi 3 Small 8k Instruct · 7.4B	77.0%
20 / 39	Qwen2.5 Coder 7B · 7.6B	76.8%
21 / 40	Phi 3 Mini 4k Instruct · 3.8B	76.7%
22 / 42	Yi 9B · 8.8B	76.4%
23 / 43	Llama 7B · 6.7B	76.2%
24 / 45	Bloom · 176.2B	74.4%
25 / 46	Yi 6B · 6.1B	74.4%
26 / 47	Xgen 7B 8k Base · 7B	74.2%
27 / 49	INTELLECT 1 Instruct · 10.2B	71.4%
28 / 50	Gemma 2B · 2.5B	71.4%
29 / 51	Qwen2.5 Coder 3B · 3.1B	70.9%
30 / 52	Baichuan2 13B Base · 13B	70.8%
31 / 54	Internlm 7B · 7B	70.6%
32 / 55	GPT Neox 20B · 20.7B	70.5%
33 / 59	Baichuan2 7B Base · 7B	68.0%
34 / 61	GPT J 6B · 6B	66.2%
35 / 62	Qwen2.5 Coder 1.5B · 1.5B	61.8%
36 / 63	Cerebras GPT 13B · 13B	59.4%
37 / 65	Chatglm2 6B · 6B	57.0%
38 / 68	Phi 2 · 2.8B	53.6%
39 / 69	Qwen2.5 Coder 0.5B · 494M	48.4%
40 / 70	Phi 1 5 · 1.4B	47.6%
41 / 75	Stablelm Tuned Alpha 7B · 7B	40.7%
42 / 76	Gpt2 Xl · 1.6B	40.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

HellaSwag: frequently asked questions

What is the best open LLM on HellaSwag?: Llama 3.1 405B is the top open model on HellaSwag, scoring 89.2%. Among all models tested — including proprietary ones — it ranks #3. The top model overall is GPT 4 (Mar 14) (OpenAI) at 95.3%.
What's the best HellaSwag model you can run on a 24 GB GPU?: Falcon 40B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 23 GB), scoring 85.3% on HellaSwag.
What's the best HellaSwag model you can run on a 12 GB GPU?: Falcon 11B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 6 GB), scoring 82.9% on HellaSwag.
Can open models match proprietary models on HellaSwag?: Not quite on HellaSwag: the strongest proprietary model (GPT 4 (Mar 14)) scores 95.3%, ahead of the best open model (Llama 3.1 405B) at 89.2% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.