Reasoning
BIG-Bench Hard Leaderboard
BIG-Bench Hard (BBH) is the 23-task subset of BIG-Bench that earlier models found hardest — multi-step reasoning, logic, and word problems. It's a long-standing standard for tracking reasoning ability across model generations.
Source: epoch11 open models ranked+39 proprietaryData through Dec 2024
Open models ranked on BIG-Bench Hard
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 4 | Llama 3.1 405B · 405B | 82.9% |
| 2 / 10 | Phi 3 Mini 4k Instruct · 3.8B | 71.7% |
| 3 / 14 | Phi 2 · 2.8B | 59.4% |
| 4 / 16 | Llama 2 70B Chat · 70B | 58.5% |
| 5 / 18 | Gemma 7B · 8.5B | 55.1% |
| 6 / 23 | Llama 2 70B HF · 69.0B | 51.2% |
| 7 / 35 | Mistral 7B v0.1 · 7B | 39.5% |
| 8 / 42 | Gemma 2B · 2.5B | 35.2% |
| 9 / 45 | Llama 2 7B · 7B | 32.6% |
| 10 / 48 | Llama 7B · 6.7B | 30.3% |
| 11 / 50 | Falcon 7B · 7.2B | 28.0% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Gemma 2B, 3B, score 35.2% — on the efficiency frontier (best score at its size or smaller).
- Phi 2, 3B, score 59.4% — on the efficiency frontier (best score at its size or smaller).
- Phi 3 Mini 4k Instruct, 4B, score 71.7% — on the efficiency frontier (best score at its size or smaller).
- Llama 3.1 405B, 405B, score 82.9% — on the efficiency frontier (best score at its size or smaller).
BIG-Bench Hard: frequently asked questions
- What is the best open LLM on BIG-Bench Hard?
- Llama 3.1 405B is the top open model on BIG-Bench Hard, scoring 82.9%. Among all models tested — including proprietary ones — it ranks #4.
- What's the best BIG-Bench Hard model you can run on a 24 GB GPU?
- Phi 3 Mini 4k Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 2 GB), scoring 71.7% on BIG-Bench Hard.
- What's the best BIG-Bench Hard model you can run on a 12 GB GPU?
- Phi 3 Mini 4k Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 2 GB), scoring 71.7% on BIG-Bench Hard.
- Can open models match proprietary models on BIG-Bench Hard?
- Not quite on BIG-Bench Hard: the strongest proprietary model (gemini-1.5-pro-001) scores 89.2%, ahead of the best open model (Llama 3.1 405B) at 82.9% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.