What is the best open LLM on BIG-Bench Hard?

DeepSeek v3 is the top open model on BIG-Bench Hard, scoring 87.5%. Among all models tested — including proprietary ones — it ranks #2. The top model overall is Gemini 1.5 Pro 001 (Google DeepMind) at 89.2%.

What's the best BIG-Bench Hard model you can run on a 24 GB GPU?

Phi 3 Small 8k Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 4 GB), scoring 79.1% on BIG-Bench Hard.

What's the best BIG-Bench Hard model you can run on a 12 GB GPU?

Phi 3 Small 8k Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 4 GB), scoring 79.1% on BIG-Bench Hard.

Can open models match proprietary models on BIG-Bench Hard?

Not quite on BIG-Bench Hard: the strongest proprietary model (Gemini 1.5 Pro 001) scores 89.2%, ahead of the best open model (DeepSeek v3) at 87.5% — but you can run the open one yourself.

Reasoning

BIG-Bench Hard Leaderboard

Name: BIG-Bench Hard — open LLM scores
Creator: epoch

BIG-Bench Hard (BBH) is the 23-task subset of BIG-Bench that earlier models found hardest — multi-step reasoning, logic, and word problems. It's a long-standing standard for tracking reasoning ability across model generations.

Source: epoch37 open models ranked+13 proprietaryData through Dec 2024

Open models All models

Open models ranked on BIG-Bench Hard

# shows rank among open models / rank overall (including proprietary).

#	Model	Score
1 / 2	DeepSeek v3 · 684.5B	87.5%
2 / 4	Llama 3.1 405B · 405.9B	82.9%
3 / 6	Qwen2.5 72B · 72.7B	79.8%
4 / 7	Phi 3 Small 8k Instruct · 7.4B	79.1%
5 / 8	DeepSeek v2 · 235.7B	78.8%
6 / 10	Phi 3 Mini 4k Instruct · 3.8B	71.7%
7 / 11	Yi 34B Chat · 34.4B	71.7%
8 / 12	StableBeluga2 · 70B	69.3%
9 / 13	Llama 2 70B HF · 69.0B	64.9%
10 / 15	Phi 2 · 2.8B	59.4%
11 / 17	Llama 2 70B Chat HF · 69.0B	58.5%
12 / 19	Llama 2 13B Chat HF · 13.0B	58.2%
13 / 20	Mistral 7B v0.1 · 7B	56.1%
14 / 21	Gemma 7B · 8.5B	55.1%
15 / 22	Qwen 14B Chat · 14.2B	55.0%
16 / 23	Yi 34B · 34.4B	54.3%
17 / 24	Qwen 14B · 14.2B	53.4%
18 / 25	Internlm 20B · 20B	52.5%
19 / 27	Baichuan2 13B Base · 13B	49.0%
20 / 28	Baichuan2 13B Chat · 13B	47.2%
21 / 29	Yi 6B Chat · 6.1B	47.2%
22 / 30	Llama 2 13B HF · 13.0B	47.0%
23 / 31	Qwen 7B · 7.7B	45.0%
24 / 34	Baichuan 13B Base · 13B	43.0%
25 / 35	Yi 6B · 6.1B	42.8%
26 / 36	Internlm Chat 20B · 20B	42.4%
27 / 37	Baichuan2 7B Base · 7B	41.6%
28 / 38	Llama 2 7B HF · 6.7B	39.2%
29 / 41	Falcon 40B · 41.8B	37.1%
30 / 42	Internlm 7B · 7B	37.0%
31 / 44	Gemma 2B · 2.5B	35.2%
32 / 45	INTELLECT 1 Instruct · 10.2B	34.8%
33 / 46	Chatglm2 6B · 6B	33.7%
34 / 47	Llama 7B · 6.7B	33.5%
35 / 48	Baichuan 7B · 7B	32.5%
36 / 49	Falcon 7B · 7.2B	28.8%
37 / 50	Qwen 1 8B · 1.8B	28.2%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

BIG-Bench Hard: frequently asked questions

What is the best open LLM on BIG-Bench Hard?: DeepSeek v3 is the top open model on BIG-Bench Hard, scoring 87.5%. Among all models tested — including proprietary ones — it ranks #2. The top model overall is Gemini 1.5 Pro 001 (Google DeepMind) at 89.2%.
What's the best BIG-Bench Hard model you can run on a 24 GB GPU?: Phi 3 Small 8k Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 4 GB), scoring 79.1% on BIG-Bench Hard.
What's the best BIG-Bench Hard model you can run on a 12 GB GPU?: Phi 3 Small 8k Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 4 GB), scoring 79.1% on BIG-Bench Hard.
Can open models match proprietary models on BIG-Bench Hard?: Not quite on BIG-Bench Hard: the strongest proprietary model (Gemini 1.5 Pro 001) scores 89.2%, ahead of the best open model (DeepSeek v3) at 87.5% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.