Reasoning
BIG-Bench Hard Leaderboard
BIG-Bench Hard (BBH) is the 23-task subset of BIG-Bench that earlier models found hardest — multi-step reasoning, logic, and word problems. It's a long-standing standard for tracking reasoning ability across model generations.
Source: epoch11 open models ranked+39 proprietaryData through Dec 2024
All models ranked on BIG-Bench Hard
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | gemini-1.5-pro-001 · proprietary | 89.2% |
| 2 | DeepSeek-V3 · proprietary | 87.5% |
| 3 | gemini-1.5-pro-001-feb24 · proprietary | 84.0% |
| 4 | Llama 3.1 405B · 405B | 82.9% |
| 5 | Phi-3-medium-128k-instruct · proprietary | 81.4% |
| 6 | Qwen2.5-72B · proprietary | 79.8% |
| 7 | Phi-3-small-8k-instruct · proprietary | 79.1% |
| 8 | DeepSeek-V2 · proprietary | 78.8% |
| 9 | gpt-4-0613 · proprietary | 75.1% |
| 10 | Phi 3 Mini 4k Instruct · 3.8B | 71.7% |
| 11 | Yi-34B-Chat · proprietary | 71.7% |
| 12 | StableBeluga2 · proprietary | 69.3% |
| 13 | gpt-3.5-turbo-0613 · proprietary | 61.6% |
| 14 | Phi 2 · 2.8B | 59.4% |
| 15 | Nemotron-4 15B · proprietary | 58.7% |
| 16 | Llama 2 70B Chat · 70B | 58.5% |
| 17 | Llama-2-13b-chat · proprietary | 58.2% |
| 18 | Gemma 7B · 8.5B | 55.1% |
| 19 | Qwen-14B-Chat · proprietary | 55.0% |
| 20 | Yi-34B · proprietary | 54.3% |
| 21 | Qwen-14B · proprietary | 53.4% |
| 22 | internlm-20b · proprietary | 52.5% |
| 23 | Llama 2 70B HF · 69.0B | 51.2% |
| 24 | Baichuan-2-13B-Base · proprietary | 49.0% |
| 25 | Baichuan2-13B-Chat · proprietary | 47.2% |
| 26 | Yi-6B-Chat · proprietary | 47.2% |
| 27 | Qwen-7B · proprietary | 45.0% |
| 28 | Llama-2-34b · proprietary | 44.1% |
| 29 | LLaMA-65B · proprietary | 43.5% |
| 30 | vicuna-13b-v1.1 · proprietary | 43.0% |
| 31 | Baichuan-13B-Base · proprietary | 43.0% |
| 32 | Yi-6B · proprietary | 42.8% |
| 33 | Baichuan-2-7B-Base · proprietary | 41.6% |
| 34 | LLaMA-33B · proprietary | 39.8% |
| 35 | Mistral 7B v0.1 · 7B | 39.5% |
| 36 | Llama-2-13b · proprietary | 39.4% |
| 37 | mpt-30b · proprietary | 38.0% |
| 38 | falcon-40b · proprietary | 37.1% |
| 39 | internlm-7b · proprietary | 37.0% |
| 40 | LLaMA-13B · proprietary | 37.0% |
| 41 | internlm-chat-20b · proprietary | 36.7% |
| 42 | Gemma 2B · 2.5B | 35.2% |
| 43 | INTELLECT-1-Instruct · proprietary | 34.8% |
| 44 | chatglm2-6b · proprietary | 33.7% |
| 45 | Llama 2 7B · 7B | 32.6% |
| 46 | Baichuan-7B · proprietary | 32.5% |
| 47 | mpt-7b · proprietary | 31.0% |
| 48 | Llama 7B · 6.7B | 30.3% |
| 49 | Qwen-1_8B · proprietary | 28.2% |
| 50 | Falcon 7B · 7.2B | 28.0% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Gemma 2B, 3B, score 35.2% — on the efficiency frontier (best score at its size or smaller).
- Phi 2, 3B, score 59.4% — on the efficiency frontier (best score at its size or smaller).
- Phi 3 Mini 4k Instruct, 4B, score 71.7% — on the efficiency frontier (best score at its size or smaller).
- Llama 3.1 405B, 405B, score 82.9% — on the efficiency frontier (best score at its size or smaller).
BIG-Bench Hard: frequently asked questions
- What is the best open LLM on BIG-Bench Hard?
- Llama 3.1 405B is the top open model on BIG-Bench Hard, scoring 82.9%. Among all models tested — including proprietary ones — it ranks #4.
- What's the best BIG-Bench Hard model you can run on a 24 GB GPU?
- Phi 3 Mini 4k Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 2 GB), scoring 71.7% on BIG-Bench Hard.
- What's the best BIG-Bench Hard model you can run on a 12 GB GPU?
- Phi 3 Mini 4k Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 2 GB), scoring 71.7% on BIG-Bench Hard.
- Can open models match proprietary models on BIG-Bench Hard?
- Not quite on BIG-Bench Hard: the strongest proprietary model (gemini-1.5-pro-001) scores 89.2%, ahead of the best open model (Llama 3.1 405B) at 82.9% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.