Math
GSM8K Leaderboard
GSM8K is a classic set of grade-school math word problems requiring several arithmetic reasoning steps. Long a default sanity check for reasoning, top models now score near the ceiling, so it best separates smaller models.
Source: epoch27 open models ranked+66 proprietaryData through Nov 2024
Open models ranked on GSM8K
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 2 | Qwen2.5 Coder 14B Instruct · 14.8B | 94.2% |
| 2 / 3 | Qwen2.5 Coder 32B Instruct · 32.8B | 93.0% |
| 3 / 9 | Qwen2.5 Coder 14B · 14.8B | 88.7% |
| 4 / 10 | DeepSeek Coder v2 Lite Instruct · 15.7B | 87.6% |
| 5 / 12 | Qwen2.5 Coder 7B Instruct · 7.6B | 86.7% |
| 6 / 13 | Phi 3.5 Mini Instruct · 3.8B | 86.2% |
| 7 / 15 | Gemma 2 9B · 9.2B | 84.9% |
| 8 / 17 | Qwen2.5 Coder 7B · 7.6B | 83.9% |
| 9 / 19 | Llama 3.1 8B Instruct · 8.0B | 82.4% |
| 10 / 21 | Qwen2.5 Coder 3B Instruct · 3.1B | 80.7% |
| 11 / 27 | DeepSeek Coder v2 Lite Base · 15.7B | 67.1% |
| 12 / 28 | Qwen2.5 Coder 1.5B · 1.5B | 65.8% |
| 13 / 29 | Llama 2 70B HF · 69.0B | 63.3% |
| 14 / 32 | Llama 2 70B Chat · 70B | 58.7% |
| 15 / 34 | Starcoder2 15B · 16.0B | 57.7% |
| 16 / 38 | Falcon 180B · 180B | 54.4% |
| 17 / 43 | Mistral 7B v0.1 · 7B | 50.0% |
| 18 / 44 | Gemma 7B · 8.5B | 46.4% |
| 19 / 54 | Mistral 7B Instruct v0.2 · 7B | 35.4% |
| 20 / 55 | Qwen2.5 Coder 0.5B · 494M | 34.5% |
| 21 / 59 | Starcoder2 7B · 7.2B | 32.7% |
| 22 / 74 | Gemma 2B · 2.5B | 17.7% |
| 23 / 77 | Llama 2 7B · 7B | 14.6% |
| 24 / 78 | Llama 7B · 6.7B | 11.0% |
| 25 / 79 | Bloom · 176.2B | 9.5% |
| 26 / 83 | Falcon 7B · 7.2B | 4.6% |
| 27 / 84 | Deepseek Coder 1.3B Base · 1.3B | 4.4% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Qwen2.5 Coder 0.5B, 494M, score 34.5% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 Coder 1.5B, 2B, score 65.8% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 Coder 3B Instruct, 3B, score 80.7% — on the efficiency frontier (best score at its size or smaller).
- Phi 3.5 Mini Instruct, 4B, score 86.2% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 Coder 7B Instruct, 8B, score 86.7% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 Coder 14B Instruct, 15B, score 94.2% — on the efficiency frontier (best score at its size or smaller).
GSM8K: frequently asked questions
- What is the best open LLM on GSM8K?
- Qwen2.5 Coder 14B Instruct is the top open model on GSM8K, scoring 94.2%. Among all models tested — including proprietary ones — it ranks #2.
- What's the best GSM8K model you can run on a 24 GB GPU?
- Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.
- What's the best GSM8K model you can run on a 12 GB GPU?
- Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.
- Can open models match proprietary models on GSM8K?
- Not quite on GSM8K: the strongest proprietary model (DeepSeek-Coder-V2-Instruct) scores 94.5%, ahead of the best open model (Qwen2.5 Coder 14B Instruct) at 94.2% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.