Math

GSM8K Leaderboard

GSM8K is a classic set of grade-school math word problems requiring several arithmetic reasoning steps. Long a default sanity check for reasoning, top models now score near the ceiling, so it best separates smaller models.

Source: epoch27 open models ranked+66 proprietaryData through Nov 2024

All models ranked on GSM8K

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1DeepSeek-Coder-V2-Instruct · proprietary
94.5%
2Qwen2.5 Coder 14B Instruct · 14.8B
94.2%
3Qwen2.5 Coder 32B Instruct · 32.8B
93.0%
4gpt-4-0314 · proprietary
92.0%
5gpt-4o-mini-2024-07-18 · proprietary
91.3%
6Qwen2.5-Coder-32B · proprietary
91.1%
7gpt-4-0613 · proprietary
90.0%
8Phi-3.5-MoE-instruct · proprietary
88.7%
9Qwen2.5 Coder 14B · 14.8B
88.7%
10DeepSeek Coder v2 Lite Instruct · 15.7B
87.6%
11claude-instant-1.2 · proprietary
86.7%
12Qwen2.5 Coder 7B Instruct · 7.6B
86.7%
13Phi 3.5 Mini Instruct · 3.8B
86.2%
14DeepSeek-Coder-V2-Base · proprietary
85.8%
15Gemma 2 9B · 9.2B
84.9%
16Mistral-Nemo-Base-2407 · proprietary
84.2%
17Qwen2.5 Coder 7B · 7.6B
83.9%
18gemini-1.5-flash-001 · proprietary
82.4%
19Llama 3.1 8B Instruct · 8.0B
82.4%
20claude-instant-1.1 · proprietary
80.9%
21Qwen2.5 Coder 3B Instruct · 3.1B
80.7%
22Yi-34B-Chat · proprietary
76.0%
23Qwen2.5-Coder-3B · proprietary
75.7%
24Mixtral-8x7B-v0.1 · proprietary
74.4%
25StableBeluga2 · proprietary
69.6%
26Yi-34B · proprietary
67.2%
27DeepSeek Coder v2 Lite Base · 15.7B
67.1%
28Qwen2.5 Coder 1.5B · 1.5B
65.8%
29Llama 2 70B HF · 69.0B
63.3%
30Qwen-14B · proprietary
61.3%
31Qwen-14B-Chat · proprietary
61.2%
32Llama 2 70B Chat · 70B
58.7%
33gpt-3.5-turbo-0613 · proprietary
57.8%
34Starcoder2 15B · 16.0B
57.7%
35text-davinci-003 · proprietary
57.1%
36code-davinci-002 · proprietary
56.8%
37PaLM 540B · proprietary
56.5%
38Falcon 180B · 180B
54.4%
39falcon-11b · proprietary
53.8%
40Baichuan-2-13B-Base · proprietary
52.8%
41Qwen-7B · proprietary
51.7%
42LLaMA-65B · proprietary
50.9%
43Mistral 7B v0.1 · 7B
50.0%
44Gemma 7B · 8.5B
46.4%
45Nemotron-4 15B · proprietary
46.0%
46Yi-6B-Chat · proprietary
44.9%
47internlm-20b · proprietary
43.4%
48LLaMA-33B · proprietary
42.3%
49Llama-2-34b · proprietary
42.2%
50text-davinci-002 · proprietary
41.5%
51INTELLECT-1-Instruct · proprietary
38.6%
52CodeQwen1.5-7B · proprietary
37.7%
53deepseek-coder-33b-base · proprietary
35.4%
54Mistral 7B Instruct v0.2 · 7B
35.4%
55Qwen2.5 Coder 0.5B · 494M
34.5%
56mpt-30b-instruct · proprietary
34.4%
57falcon-40b-instruct · proprietary
33.8%
58PaLM 62B · proprietary
33.0%
59Starcoder2 7B · 7.2B
32.7%
60Yi-6B · proprietary
32.5%
61chatglm2-6b · proprietary
32.4%
62internlm-7b · proprietary
31.2%
63Llama-2-13b · proprietary
29.6%
64vicuna-13b-v1.1 · proprietary
28.1%
65Baichuan-13B-Base · proprietary
26.8%
66Baichuan-2-7B-Base · proprietary
24.6%
67Baichuan2-13B-Chat · proprietary
23.3%
68vicuna-13b-v1.3 · proprietary
22.6%
69starcoder2-3b · proprietary
21.6%
70falcon-40b · proprietary
21.5%
71deepseek-coder-6.7b-base · proprietary
21.3%
72Qwen-1_8B · proprietary
21.2%
73LLaMA-13B · proprietary
20.3%
74Gemma 2B · 2.5B
17.7%
75internlm-chat-20b · proprietary
15.7%
76mpt-30b · proprietary
15.2%
77Llama 2 7B · 7B
14.6%
78Llama 7B · 6.7B
11.0%
79Bloom · 176.2B
9.5%
80Baichuan-7B · proprietary
9.2%
81davinci · proprietary
9.0%
82mpt-7b · proprietary
6.8%
83Falcon 7B · 7.2B
4.6%
84Deepseek Coder 1.3B Base · 1.3B
4.4%
85opt-175b · proprietary
4.0%
86Llama-2-13b-chat · proprietary
2.7%
87opt-66b · proprietary
1.8%
88curie · proprietary
1.6%
89babbage · proprietary
0.7%
90ada · proprietary
0.6%
91text-curie-001 · proprietary
0.6%
92text-ada-001 · proprietary
0.4%
93text-babbage-001 · proprietary
0.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

1B10B100Bmodel size (log scale) →94.2%4.4%Qwen2.5 Coder 32B Instruct · 33B · 93.0%Qwen2.5 Coder 14B · 15B · 88.7%DeepSeek Coder v2 Lite Instruct · 16B · 87.6%Gemma 2 9B · 9B · 84.9%Qwen2.5 Coder 7B · 8B · 83.9%Llama 3.1 8B Instruct · 8B · 82.4%DeepSeek Coder v2 Lite Base · 16B · 67.1%Llama 2 70B HF · 69B · 63.3%Llama 2 70B Chat · 70B · 58.7%Starcoder2 15B · 16B · 57.7%Falcon 180B · 180B · 54.4%Mistral 7B v0.1 · 7B · 50.0%Gemma 7B · 9B · 46.4%Mistral 7B Instruct v0.2 · 7B · 35.4%Starcoder2 7B · 7B · 32.7%Gemma 2B · 3B · 17.7%Llama 2 7B · 7B · 14.6%Llama 7B · 7B · 11.0%Bloom · 176B · 9.5%Falcon 7B · 7B · 4.6%Deepseek Coder 1.3B Base · 1B · 4.4%Qwen2.5 Coder 0.5B · 494M · 34.5%Qwen2.5 Coder 0.5BQwen2.5 Coder 1.5B · 2B · 65.8%Qwen2.5 Coder 1.5BQwen2.5 Coder 3B Instruct · 3B · 80.7%Qwen2.5 Coder 3B Inst…Phi 3.5 Mini Instruct · 4B · 86.2%Phi 3.5 Mini InstructQwen2.5 Coder 7B Instruct · 8B · 86.7%Qwen2.5 Coder 7B Inst…Qwen2.5 Coder 14B Instruct · 15B · 94.2%Qwen2.5 Coder 14B Ins…
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • Qwen2.5 Coder 0.5B, 494M, score 34.5% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 Coder 1.5B, 2B, score 65.8% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 Coder 3B Instruct, 3B, score 80.7% — on the efficiency frontier (best score at its size or smaller).
  • Phi 3.5 Mini Instruct, 4B, score 86.2% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 Coder 7B Instruct, 8B, score 86.7% — on the efficiency frontier (best score at its size or smaller).
  • Qwen2.5 Coder 14B Instruct, 15B, score 94.2% — on the efficiency frontier (best score at its size or smaller).

GSM8K: frequently asked questions

What is the best open LLM on GSM8K?
Qwen2.5 Coder 14B Instruct is the top open model on GSM8K, scoring 94.2%. Among all models tested — including proprietary ones — it ranks #2.
What's the best GSM8K model you can run on a 24 GB GPU?
Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.
What's the best GSM8K model you can run on a 12 GB GPU?
Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.
Can open models match proprietary models on GSM8K?
Not quite on GSM8K: the strongest proprietary model (DeepSeek-Coder-V2-Instruct) scores 94.5%, ahead of the best open model (Qwen2.5 Coder 14B Instruct) at 94.2% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.