Reasoning

BIG-Bench Hard Leaderboard

BIG-Bench Hard (BBH) is the 23-task subset of BIG-Bench that earlier models found hardest — multi-step reasoning, logic, and word problems. It's a long-standing standard for tracking reasoning ability across model generations.

Source: epoch11 open models ranked+39 proprietaryData through Dec 2024

Open models ranked on BIG-Bench Hard

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 4Llama 3.1 405B · 405B
82.9%
2 / 10Phi 3 Mini 4k Instruct · 3.8B
71.7%
3 / 14Phi 2 · 2.8B
59.4%
4 / 16Llama 2 70B Chat · 70B
58.5%
5 / 18Gemma 7B · 8.5B
55.1%
6 / 23Llama 2 70B HF · 69.0B
51.2%
7 / 35Mistral 7B v0.1 · 7B
39.5%
8 / 42Gemma 2B · 2.5B
35.2%
9 / 45Llama 2 7B · 7B
32.6%
10 / 48Llama 7B · 6.7B
30.3%
11 / 50Falcon 7B · 7.2B
28.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

10B100Bmodel size (log scale) →82.9%28.0%Llama 2 70B Chat · 70B · 58.5%Gemma 7B · 9B · 55.1%Llama 2 70B HF · 69B · 51.2%Mistral 7B v0.1 · 7B · 39.5%Llama 2 7B · 7B · 32.6%Llama 7B · 7B · 30.3%Falcon 7B · 7B · 28.0%Gemma 2B · 3B · 35.2%Gemma 2BPhi 2 · 3B · 59.4%Phi 2Phi 3 Mini 4k Instruct · 4B · 71.7%Phi 3 Mini 4k InstructLlama 3.1 405B · 405B · 82.9%Llama 3.1 405B
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Gemma 2B, 3B, score 35.2% — on the efficiency frontier (best score at its size or smaller).
  • Phi 2, 3B, score 59.4% — on the efficiency frontier (best score at its size or smaller).
  • Phi 3 Mini 4k Instruct, 4B, score 71.7% — on the efficiency frontier (best score at its size or smaller).
  • Llama 3.1 405B, 405B, score 82.9% — on the efficiency frontier (best score at its size or smaller).

BIG-Bench Hard: frequently asked questions

What is the best open LLM on BIG-Bench Hard?
Llama 3.1 405B is the top open model on BIG-Bench Hard, scoring 82.9%. Among all models tested — including proprietary ones — it ranks #4.
What's the best BIG-Bench Hard model you can run on a 24 GB GPU?
Phi 3 Mini 4k Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 2 GB), scoring 71.7% on BIG-Bench Hard.
What's the best BIG-Bench Hard model you can run on a 12 GB GPU?
Phi 3 Mini 4k Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 2 GB), scoring 71.7% on BIG-Bench Hard.
Can open models match proprietary models on BIG-Bench Hard?
Not quite on BIG-Bench Hard: the strongest proprietary model (gemini-1.5-pro-001) scores 89.2%, ahead of the best open model (Llama 3.1 405B) at 82.9% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.