Reasoning

BIG-Bench Hard Leaderboard

BIG-Bench Hard (BBH) is the 23-task subset of BIG-Bench that earlier models found hardest — multi-step reasoning, logic, and word problems. It's a long-standing standard for tracking reasoning ability across model generations.

Source: epoch11 open models ranked+39 proprietaryData through Dec 2024

All models ranked on BIG-Bench Hard

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gemini-1.5-pro-001 · proprietary
89.2%
2DeepSeek-V3 · proprietary
87.5%
3gemini-1.5-pro-001-feb24 · proprietary
84.0%
4Llama 3.1 405B · 405B
82.9%
5Phi-3-medium-128k-instruct · proprietary
81.4%
6Qwen2.5-72B · proprietary
79.8%
7Phi-3-small-8k-instruct · proprietary
79.1%
8DeepSeek-V2 · proprietary
78.8%
9gpt-4-0613 · proprietary
75.1%
10Phi 3 Mini 4k Instruct · 3.8B
71.7%
11Yi-34B-Chat · proprietary
71.7%
12StableBeluga2 · proprietary
69.3%
13gpt-3.5-turbo-0613 · proprietary
61.6%
14Phi 2 · 2.8B
59.4%
15Nemotron-4 15B · proprietary
58.7%
16Llama 2 70B Chat · 70B
58.5%
17Llama-2-13b-chat · proprietary
58.2%
18Gemma 7B · 8.5B
55.1%
19Qwen-14B-Chat · proprietary
55.0%
20Yi-34B · proprietary
54.3%
21Qwen-14B · proprietary
53.4%
22internlm-20b · proprietary
52.5%
23Llama 2 70B HF · 69.0B
51.2%
24Baichuan-2-13B-Base · proprietary
49.0%
25Baichuan2-13B-Chat · proprietary
47.2%
26Yi-6B-Chat · proprietary
47.2%
27Qwen-7B · proprietary
45.0%
28Llama-2-34b · proprietary
44.1%
29LLaMA-65B · proprietary
43.5%
30vicuna-13b-v1.1 · proprietary
43.0%
31Baichuan-13B-Base · proprietary
43.0%
32Yi-6B · proprietary
42.8%
33Baichuan-2-7B-Base · proprietary
41.6%
34LLaMA-33B · proprietary
39.8%
35Mistral 7B v0.1 · 7B
39.5%
36Llama-2-13b · proprietary
39.4%
37mpt-30b · proprietary
38.0%
38falcon-40b · proprietary
37.1%
39internlm-7b · proprietary
37.0%
40LLaMA-13B · proprietary
37.0%
41internlm-chat-20b · proprietary
36.7%
42Gemma 2B · 2.5B
35.2%
43INTELLECT-1-Instruct · proprietary
34.8%
44chatglm2-6b · proprietary
33.7%
45Llama 2 7B · 7B
32.6%
46Baichuan-7B · proprietary
32.5%
47mpt-7b · proprietary
31.0%
48Llama 7B · 6.7B
30.3%
49Qwen-1_8B · proprietary
28.2%
50Falcon 7B · 7.2B
28.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

10B100Bmodel size (log scale) →82.9%28.0%Llama 2 70B Chat · 70B · 58.5%Gemma 7B · 9B · 55.1%Llama 2 70B HF · 69B · 51.2%Mistral 7B v0.1 · 7B · 39.5%Llama 2 7B · 7B · 32.6%Llama 7B · 7B · 30.3%Falcon 7B · 7B · 28.0%Gemma 2B · 3B · 35.2%Gemma 2BPhi 2 · 3B · 59.4%Phi 2Phi 3 Mini 4k Instruct · 4B · 71.7%Phi 3 Mini 4k InstructLlama 3.1 405B · 405B · 82.9%Llama 3.1 405B
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Gemma 2B, 3B, score 35.2% — on the efficiency frontier (best score at its size or smaller).
  • Phi 2, 3B, score 59.4% — on the efficiency frontier (best score at its size or smaller).
  • Phi 3 Mini 4k Instruct, 4B, score 71.7% — on the efficiency frontier (best score at its size or smaller).
  • Llama 3.1 405B, 405B, score 82.9% — on the efficiency frontier (best score at its size or smaller).

BIG-Bench Hard: frequently asked questions

What is the best open LLM on BIG-Bench Hard?
Llama 3.1 405B is the top open model on BIG-Bench Hard, scoring 82.9%. Among all models tested — including proprietary ones — it ranks #4.
What's the best BIG-Bench Hard model you can run on a 24 GB GPU?
Phi 3 Mini 4k Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 2 GB), scoring 71.7% on BIG-Bench Hard.
What's the best BIG-Bench Hard model you can run on a 12 GB GPU?
Phi 3 Mini 4k Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 2 GB), scoring 71.7% on BIG-Bench Hard.
Can open models match proprietary models on BIG-Bench Hard?
Not quite on BIG-Bench Hard: the strongest proprietary model (gemini-1.5-pro-001) scores 89.2%, ahead of the best open model (Llama 3.1 405B) at 82.9% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.