Math

GSM8K Leaderboard

GSM8K is a classic set of grade-school math word problems requiring several arithmetic reasoning steps. Long a default sanity check for reasoning, top models now score near the ceiling, so it best separates smaller models.

Source: epoch27 open models ranked+66 proprietaryData through Nov 2024

Open models ranked on GSM8K

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 2Qwen2.5 Coder 14B Instruct · 14.8B
94.2%
2 / 3Qwen2.5 Coder 32B Instruct · 32.8B
93.0%
3 / 9Qwen2.5 Coder 14B · 14.8B
88.7%
4 / 10DeepSeek Coder v2 Lite Instruct · 15.7B
87.6%
5 / 12Qwen2.5 Coder 7B Instruct · 7.6B
86.7%
6 / 13Phi 3.5 Mini Instruct · 3.8B
86.2%
7 / 15Gemma 2 9B · 9.2B
84.9%
8 / 17Qwen2.5 Coder 7B · 7.6B
83.9%
9 / 19Llama 3.1 8B Instruct · 8.0B
82.4%
10 / 21Qwen2.5 Coder 3B Instruct · 3.1B
80.7%
11 / 27DeepSeek Coder v2 Lite Base · 15.7B
67.1%
12 / 28Qwen2.5 Coder 1.5B · 1.5B
65.8%
13 / 29Llama 2 70B HF · 69.0B
63.3%
14 / 32Llama 2 70B Chat · 70B
58.7%
15 / 34Starcoder2 15B · 16.0B
57.7%
16 / 38Falcon 180B · 180B
54.4%
17 / 43Mistral 7B v0.1 · 7B
50.0%
18 / 44Gemma 7B · 8.5B
46.4%
19 / 54Mistral 7B Instruct v0.2 · 7B
35.4%
20 / 55Qwen2.5 Coder 0.5B · 494M
34.5%
21 / 59Starcoder2 7B · 7.2B
32.7%
22 / 74Gemma 2B · 2.5B
17.7%
23 / 77Llama 2 7B · 7B
14.6%
24 / 78Llama 7B · 6.7B
11.0%
25 / 79Bloom · 176.2B
9.5%
26 / 83Falcon 7B · 7.2B
4.6%
27 / 84Deepseek Coder 1.3B Base · 1.3B
4.4%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

1B10B100Bmodel size (log scale) →94.2%4.4%Qwen2.5 Coder 32B Instruct · 33B · 93.0%Qwen2.5 Coder 14B · 15B · 88.7%DeepSeek Coder v2 Lite Instruct · 16B · 87.6%Gemma 2 9B · 9B · 84.9%Qwen2.5 Coder 7B · 8B · 83.9%Llama 3.1 8B Instruct · 8B · 82.4%DeepSeek Coder v2 Lite Base · 16B · 67.1%Llama 2 70B HF · 69B · 63.3%Llama 2 70B Chat · 70B · 58.7%Starcoder2 15B · 16B · 57.7%Falcon 180B · 180B · 54.4%Mistral 7B v0.1 · 7B · 50.0%Gemma 7B · 9B · 46.4%Mistral 7B Instruct v0.2 · 7B · 35.4%Starcoder2 7B · 7B · 32.7%Gemma 2B · 3B · 17.7%Llama 2 7B · 7B · 14.6%Llama 7B · 7B · 11.0%Bloom · 176B · 9.5%Falcon 7B · 7B · 4.6%Deepseek Coder 1.3B Base · 1B · 4.4%Qwen2.5 Coder 0.5B · 494M · 34.5%Qwen2.5 Coder 0.5BQwen2.5 Coder 1.5B · 2B · 65.8%Qwen2.5 Coder 1.5BQwen2.5 Coder 3B Instruct · 3B · 80.7%Qwen2.5 Coder 3B Inst…Phi 3.5 Mini Instruct · 4B · 86.2%Phi 3.5 Mini InstructQwen2.5 Coder 7B Instruct · 8B · 86.7%Qwen2.5 Coder 7B Inst…Qwen2.5 Coder 14B Instruct · 15B · 94.2%Qwen2.5 Coder 14B Ins…
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Qwen2.5 Coder 0.5B, 494M, score 34.5% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 Coder 1.5B, 2B, score 65.8% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 Coder 3B Instruct, 3B, score 80.7% — on the efficiency frontier (best score at its size or smaller).
  • Phi 3.5 Mini Instruct, 4B, score 86.2% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 Coder 7B Instruct, 8B, score 86.7% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 Coder 14B Instruct, 15B, score 94.2% — on the efficiency frontier (best score at its size or smaller).

GSM8K: frequently asked questions

What is the best open LLM on GSM8K?
Qwen2.5 Coder 14B Instruct is the top open model on GSM8K, scoring 94.2%. Among all models tested — including proprietary ones — it ranks #2.
What's the best GSM8K model you can run on a 24 GB GPU?
Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.
What's the best GSM8K model you can run on a 12 GB GPU?
Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.
Can open models match proprietary models on GSM8K?
Not quite on GSM8K: the strongest proprietary model (DeepSeek-Coder-V2-Instruct) scores 94.5%, ahead of the best open model (Qwen2.5 Coder 14B Instruct) at 94.2% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.