What is the best open LLM on GSM8K?

DeepSeek Coder v2 Instruct is the top open model on GSM8K, scoring 94.5%. Among all models tested — including proprietary ones — it ranks #1. That puts it ahead of every proprietary model we track, including GPT 4 (Mar 14) (OpenAI) at 92.0%.

What's the best GSM8K model you can run on a 24 GB GPU?

Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.

What's the best GSM8K model you can run on a 12 GB GPU?

Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.

Can open models match proprietary models on GSM8K?

Yes — the best open model (DeepSeek Coder v2 Instruct, 94.5%) matches or beats every proprietary model we track on GSM8K.

Math

GSM8K Leaderboard

Name: GSM8K — open LLM scores
Creator: epoch

GSM8K is a classic set of grade-school math word problems requiring several arithmetic reasoning steps. Long a default sanity check for reasoning, top models now score near the ceiling, so it best separates smaller models.

Source: epoch59 open models ranked+34 proprietaryData through Nov 2024

Open models All models

Open models ranked on GSM8K

# shows rank among open models / rank overall (including proprietary).

#	Model	Score
1 / 1	DeepSeek Coder v2 Instruct · 235.7B	94.5%
2 / 2	Qwen2.5 Coder 14B Instruct · 14.8B	94.2%
3 / 3	Qwen2.5 Coder 32B Instruct · 32.8B	93.0%
4 / 6	Qwen2.5 Coder 32B · 32.8B	91.1%
5 / 8	Phi 3.5 MoE Instruct · 41.9B	88.7%
6 / 9	Qwen2.5 Coder 14B · 14.8B	88.7%
7 / 10	DeepSeek Coder v2 Lite Instruct · 15.7B	87.6%
8 / 12	Qwen2.5 Coder 7B Instruct · 7.6B	86.7%
9 / 13	Phi 3.5 Mini Instruct · 3.8B	86.2%
10 / 15	Gemma 2 9B · 9.2B	84.9%
11 / 17	Qwen2.5 Coder 7B · 7.6B	83.9%
12 / 19	Llama 3.1 8B Instruct · 8.0B	82.4%
13 / 21	Qwen2.5 Coder 3B Instruct · 3.1B	80.7%
14 / 23	Yi 34B Chat · 34.4B	76.0%
15 / 24	Qwen2.5 Coder 3B · 3.1B	75.7%
16 / 25	Mixtral 8x7B v0.1 · 46.7B	74.4%
17 / 26	Llama 2 70B HF · 69.0B	69.6%
18 / 27	StableBeluga2 · 70B	69.6%
19 / 28	Yi 34B · 34.4B	67.2%
20 / 29	DeepSeek Coder v2 Lite Base · 15.7B	67.1%
21 / 30	Qwen2.5 Coder 1.5B · 1.5B	65.8%
22 / 31	Internlm 20B · 20B	62.9%
23 / 32	Qwen 14B · 14.2B	61.3%
24 / 33	Qwen 14B Chat · 14.2B	61.2%
25 / 34	Llama 2 70B Chat HF · 69.0B	58.7%
26 / 36	Starcoder2 15B · 16.0B	57.7%
27 / 39	Falcon 180B · 180B	54.4%
28 / 41	Mistral 7B v0.1 · 7B	54.4%
29 / 42	Falcon 11B · 11.1B	53.8%
30 / 43	Baichuan2 13B Base · 13B	52.8%
31 / 44	Qwen 7B · 7.7B	51.7%
32 / 45	Gemma 7B · 8.5B	46.4%
33 / 47	Baichuan2 13B Chat · 13B	45.7%
34 / 48	Yi 6B Chat · 6.1B	44.9%
35 / 52	INTELLECT 1 Instruct · 10.2B	38.6%
36 / 53	CodeQwen1.5 7B · 7.3B	37.7%
37 / 54	Llama 2 13B Chat HF · 13.0B	36.9%
38 / 56	Mistral 7B Instruct v0.2 · 7.2B	35.4%
39 / 57	Qwen2.5 Coder 0.5B · 494M	34.5%
40 / 59	Llama 2 13B HF · 13.0B	34.3%
41 / 60	Falcon 40B Instruct · 40B	33.8%
42 / 62	Starcoder2 7B · 7.2B	32.7%
43 / 63	Yi 6B · 6.1B	32.5%
44 / 64	Chatglm2 6B · 6B	32.4%
45 / 65	Internlm 7B · 7B	31.2%
46 / 67	Baichuan 13B Base · 13B	26.8%
47 / 68	Falcon 40B · 41.8B	25.0%
48 / 69	Baichuan2 7B Base · 7B	24.6%
49 / 70	Vicuna 13B V1.3 · 13B	22.6%
50 / 71	Starcoder2 3B · 3.0B	21.6%
51 / 73	Qwen 1 8B · 1.8B	21.2%
52 / 75	Gemma 2B · 2.5B	17.7%
53 / 76	Llama 2 7B HF · 6.7B	16.7%
54 / 78	Internlm Chat 20B · 20B	15.7%
55 / 79	Llama 7B · 6.7B	11.0%
56 / 80	Bloom · 176.2B	9.5%
57 / 81	Baichuan 7B · 7B	9.2%
58 / 84	Falcon 7B · 7.2B	6.8%
59 / 85	Deepseek Coder 1.3B Base · 1.3B	4.4%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

GSM8K: frequently asked questions

What is the best open LLM on GSM8K?: DeepSeek Coder v2 Instruct is the top open model on GSM8K, scoring 94.5%. Among all models tested — including proprietary ones — it ranks #1. That puts it ahead of every proprietary model we track, including GPT 4 (Mar 14) (OpenAI) at 92.0%.
What's the best GSM8K model you can run on a 24 GB GPU?: Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.
What's the best GSM8K model you can run on a 12 GB GPU?: Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.
Can open models match proprietary models on GSM8K?: Yes — the best open model (DeepSeek Coder v2 Instruct, 94.5%) matches or beats every proprietary model we track on GSM8K.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.