What is the best open LLM on HellaSwag?

Llama 3.1 405B is the top open model on HellaSwag, scoring 89.2%. Among all models tested — including proprietary ones — it ranks #3. The top model overall is GPT 4 (Mar 14) (OpenAI) at 95.3%.

What's the best HellaSwag model you can run on a 24 GB GPU?

Falcon 40B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 23 GB), scoring 85.3% on HellaSwag.

What's the best HellaSwag model you can run on a 12 GB GPU?

Falcon 11B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 6 GB), scoring 82.9% on HellaSwag.

Can open models match proprietary models on HellaSwag?

Not quite on HellaSwag: the strongest proprietary model (GPT 4 (Mar 14)) scores 95.3%, ahead of the best open model (Llama 3.1 405B) at 89.2% — but you can run the open one yourself.

Knowledge

HellaSwag Leaderboard

Name: HellaSwag — open LLM scores
Creator: epoch

HellaSwag is a commonsense benchmark that asks a model to pick the most plausible continuation of an everyday situation. The wrong options are adversarially chosen to fool models, testing grounded commonsense understanding.

Source: epoch42 open models ranked+34 proprietaryData through Dec 2024

Open models All models

All models ranked on HellaSwag

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#	Model	Score
1	GPT 4 (Mar 14) · proprietary	95.3%
2	GPT 4 32K (Mar 14) · proprietary	95.3%
3	Llama 3.1 405B · 405.9B	89.2%
4	Falcon 180B · 180B	89.0%
5	DeepSeek v3 · 684.5B	88.9%
6	DeepSeek v2 · 235.7B	87.1%
7	PaLM 2 L · proprietary	86.8%
8	Mixtral 8x7B v0.1 · 46.7B	86.7%
9	Text Davinci 003 · proprietary	85.5%
10	Llama 2 70B HF · 69.0B	85.3%
11	Falcon 40B · 41.8B	85.3%
12	Qwen2.5 72B · 72.7B	84.8%
13	Llama 65B · proprietary	84.2%
14	StableBeluga2 · 70B	84.1%
15	PaLM 2 M · proprietary	84.0%
16	PaLM 540B · proprietary	83.8%
17	Qwen2.5 Coder 32B · 32.8B	83.0%
18	Falcon 11B · 11.1B	82.9%
19	Llama 33B · proprietary	82.8%
20	Megatron Turing NLG 530B · proprietary	82.4%
21	Nemotron 4 15B · proprietary	82.4%
22	Phi 3 Medium 128K Instruct · proprietary	82.4%
23	Gemma 7B · 8.5B	82.2%
24	PaLM 2 S · proprietary	82.0%
25	Text Davinci 002 · proprietary	81.5%
26	Mistral 7B v0.1 · 7B	81.0%
27	Llama 2 13B HF · 13.0B	80.7%
28	Qwen2.5 Coder 14B · 14.8B	80.2%
29	Text Davinci 001 · proprietary	79.3%
30	Gopher (280B) · proprietary	79.2%
31	Llama 13B · proprietary	79.2%
32	Opt 175B · proprietary	79.1%
33	Falcon 7B · 7.2B	78.1%
34	Internlm 20B · 20B	78.1%
35	Davinci · proprietary	77.5%
36	GLaM (MoE) · proprietary	77.2%
37	Llama 2 7B HF · 6.7B	77.2%
38	Phi 3 Small 8k Instruct · 7.4B	77.0%
39	Qwen2.5 Coder 7B · 7.6B	76.8%
40	Phi 3 Mini 4k Instruct · 3.8B	76.7%
41	Mpt 7B · proprietary	76.4%
42	Yi 9B · 8.8B	76.4%
43	Llama 7B · 6.7B	76.2%
44	Opt 66B · proprietary	74.5%
45	Bloom · 176.2B	74.4%
46	Yi 6B · 6.1B	74.4%
47	Xgen 7B 8k Base · 7B	74.2%
48	Open Llama 7B · proprietary	71.8%
49	INTELLECT 1 Instruct · 10.2B	71.4%
50	Gemma 2B · 2.5B	71.4%
51	Qwen2.5 Coder 3B · 3.1B	70.9%
52	Baichuan2 13B Base · 13B	70.8%
53	Dolly v2 12B · proprietary	70.8%
54	Internlm 7B · 7B	70.6%
55	GPT Neox 20B · 20.7B	70.5%
56	RedPajama INCITE 7B Base · proprietary	70.3%
57	Opt 13B · proprietary	69.9%
58	Curie · proprietary	68.2%
59	Baichuan2 7B Base · 7B	68.0%
60	Text Curie 001 · proprietary	67.6%
61	GPT J 6B · 6B	66.2%
62	Qwen2.5 Coder 1.5B · 1.5B	61.8%
63	Cerebras GPT 13B · 13B	59.4%
64	Vicuna 13B v1.1 · proprietary	57.8%
65	Chatglm2 6B · 6B	57.0%
66	Text Babbage 001 · proprietary	56.1%
67	Babbage · proprietary	55.5%
68	Phi 2 · 2.8B	53.6%
69	Qwen2.5 Coder 0.5B · 494M	48.4%
70	Phi 1 5 · 1.4B	47.6%
71	Ada · proprietary	43.5%
72	Text Ada 001 · proprietary	42.9%
73	GPT Neo 2.7B · proprietary	42.7%
74	Opt 1.3b · proprietary	41.5%
75	Stablelm Tuned Alpha 7B · 7B	40.7%
76	Gpt2 Xl · 1.6B	40.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

HellaSwag: frequently asked questions

What is the best open LLM on HellaSwag?: Llama 3.1 405B is the top open model on HellaSwag, scoring 89.2%. Among all models tested — including proprietary ones — it ranks #3. The top model overall is GPT 4 (Mar 14) (OpenAI) at 95.3%.
What's the best HellaSwag model you can run on a 24 GB GPU?: Falcon 40B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 23 GB), scoring 85.3% on HellaSwag.
What's the best HellaSwag model you can run on a 12 GB GPU?: Falcon 11B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 6 GB), scoring 82.9% on HellaSwag.
Can open models match proprietary models on HellaSwag?: Not quite on HellaSwag: the strongest proprietary model (GPT 4 (Mar 14)) scores 95.3%, ahead of the best open model (Llama 3.1 405B) at 89.2% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.