What is the best open LLM on BIG-Bench Hard?

DeepSeek v3 is the top open model on BIG-Bench Hard, scoring 87.5%. Among all models tested — including proprietary ones — it ranks #2. The top model overall is Gemini 1.5 Pro 001 (Google DeepMind) at 89.2%.

What's the best BIG-Bench Hard model you can run on a 24 GB GPU?

Phi 3 Small 8k Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 4 GB), scoring 79.1% on BIG-Bench Hard.

What's the best BIG-Bench Hard model you can run on a 12 GB GPU?

Phi 3 Small 8k Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 4 GB), scoring 79.1% on BIG-Bench Hard.

Can open models match proprietary models on BIG-Bench Hard?

Not quite on BIG-Bench Hard: the strongest proprietary model (Gemini 1.5 Pro 001) scores 89.2%, ahead of the best open model (DeepSeek v3) at 87.5% — but you can run the open one yourself.

Reasoning

BIG-Bench Hard Leaderboard

Name: BIG-Bench Hard — open LLM scores
Creator: epoch

BIG-Bench Hard (BBH) is the 23-task subset of BIG-Bench that earlier models found hardest — multi-step reasoning, logic, and word problems. It's a long-standing standard for tracking reasoning ability across model generations.

Source: epoch37 open models ranked+13 proprietaryData through Dec 2024

Open models All models

All models ranked on BIG-Bench Hard

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#	Model	Score
1	Gemini 1.5 Pro 001 · proprietary	89.2%
2	DeepSeek v3 · 684.5B	87.5%
3	Gemini 1.5 Pro 001 Feb24 · proprietary	84.0%
4	Llama 3.1 405B · 405.9B	82.9%
5	Phi 3 Medium 128K Instruct · proprietary	81.4%
6	Qwen2.5 72B · 72.7B	79.8%
7	Phi 3 Small 8k Instruct · 7.4B	79.1%
8	DeepSeek v2 · 235.7B	78.8%
9	GPT 4 (Jun 13) · proprietary	75.1%
10	Phi 3 Mini 4k Instruct · 3.8B	71.7%
11	Yi 34B Chat · 34.4B	71.7%
12	StableBeluga2 · 70B	69.3%
13	Llama 2 70B HF · 69.0B	64.9%
14	GPT 3.5 Turbo (Jun 13) · proprietary	61.6%
15	Phi 2 · 2.8B	59.4%
16	Nemotron 4 15B · proprietary	58.7%
17	Llama 2 70B Chat HF · 69.0B	58.5%
18	Llama 65B · proprietary	58.4%
19	Llama 2 13B Chat HF · 13.0B	58.2%
20	Mistral 7B v0.1 · 7B	56.1%
21	Gemma 7B · 8.5B	55.1%
22	Qwen 14B Chat · 14.2B	55.0%
23	Yi 34B · 34.4B	54.3%
24	Qwen 14B · 14.2B	53.4%
25	Internlm 20B · 20B	52.5%
26	Llama 33B · proprietary	50.0%
27	Baichuan2 13B Base · 13B	49.0%
28	Baichuan2 13B Chat · 13B	47.2%
29	Yi 6B Chat · 6.1B	47.2%
30	Llama 2 13B HF · 13.0B	47.0%
31	Qwen 7B · 7.7B	45.0%
32	Llama 2 34B · proprietary	44.1%
33	Vicuna 13B v1.1 · proprietary	43.0%
34	Baichuan 13B Base · 13B	43.0%
35	Yi 6B · 6.1B	42.8%
36	Internlm Chat 20B · 20B	42.4%
37	Baichuan2 7B Base · 7B	41.6%
38	Llama 2 7B HF · 6.7B	39.2%
39	Mpt 30B · proprietary	38.0%
40	Llama 13B · proprietary	37.9%
41	Falcon 40B · 41.8B	37.1%
42	Internlm 7B · 7B	37.0%
43	Mpt 7B · proprietary	35.6%
44	Gemma 2B · 2.5B	35.2%
45	INTELLECT 1 Instruct · 10.2B	34.8%
46	Chatglm2 6B · 6B	33.7%
47	Llama 7B · 6.7B	33.5%
48	Baichuan 7B · 7B	32.5%
49	Falcon 7B · 7.2B	28.8%
50	Qwen 1 8B · 1.8B	28.2%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

BIG-Bench Hard: frequently asked questions

What is the best open LLM on BIG-Bench Hard?: DeepSeek v3 is the top open model on BIG-Bench Hard, scoring 87.5%. Among all models tested — including proprietary ones — it ranks #2. The top model overall is Gemini 1.5 Pro 001 (Google DeepMind) at 89.2%.
What's the best BIG-Bench Hard model you can run on a 24 GB GPU?: Phi 3 Small 8k Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 4 GB), scoring 79.1% on BIG-Bench Hard.
What's the best BIG-Bench Hard model you can run on a 12 GB GPU?: Phi 3 Small 8k Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 4 GB), scoring 79.1% on BIG-Bench Hard.
Can open models match proprietary models on BIG-Bench Hard?: Not quite on BIG-Bench Hard: the strongest proprietary model (Gemini 1.5 Pro 001) scores 89.2%, ahead of the best open model (DeepSeek v3) at 87.5% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.