What is the best open LLM on GSM8K?

DeepSeek Coder v2 Instruct is the top open model on GSM8K, scoring 94.5%. Among all models tested — including proprietary ones — it ranks #1. That puts it ahead of every proprietary model we track, including GPT 4 (Mar 14) (OpenAI) at 92.0%.

What's the best GSM8K model you can run on a 24 GB GPU?

Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.

What's the best GSM8K model you can run on a 12 GB GPU?

Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.

Can open models match proprietary models on GSM8K?

Yes — the best open model (DeepSeek Coder v2 Instruct, 94.5%) matches or beats every proprietary model we track on GSM8K.

Math

GSM8K Leaderboard

Name: GSM8K — open LLM scores
Creator: epoch

GSM8K is a classic set of grade-school math word problems requiring several arithmetic reasoning steps. Long a default sanity check for reasoning, top models now score near the ceiling, so it best separates smaller models.

Source: epoch59 open models ranked+34 proprietaryData through Nov 2024

Open models All models

All models ranked on GSM8K

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#	Model	Score
1	DeepSeek Coder v2 Instruct · 235.7B	94.5%
2	Qwen2.5 Coder 14B Instruct · 14.8B	94.2%
3	Qwen2.5 Coder 32B Instruct · 32.8B	93.0%
4	GPT 4 (Mar 14) · proprietary	92.0%
5	GPT 4o Mini (Jul 18, 2024) · proprietary	91.3%
6	Qwen2.5 Coder 32B · 32.8B	91.1%
7	GPT 4 (Jun 13) · proprietary	90.0%
8	Phi 3.5 MoE Instruct · 41.9B	88.7%
9	Qwen2.5 Coder 14B · 14.8B	88.7%
10	DeepSeek Coder v2 Lite Instruct · 15.7B	87.6%
11	Claude Instant 1.2 · proprietary	86.7%
12	Qwen2.5 Coder 7B Instruct · 7.6B	86.7%
13	Phi 3.5 Mini Instruct · 3.8B	86.2%
14	DeepSeek Coder V2 Base · proprietary	85.8%
15	Gemma 2 9B · 9.2B	84.9%
16	Mistral Nemo Base 2407 · proprietary	84.2%
17	Qwen2.5 Coder 7B · 7.6B	83.9%
18	Gemini 1.5 Flash 001 · proprietary	82.4%
19	Llama 3.1 8B Instruct · 8.0B	82.4%
20	Claude Instant 1.1 · proprietary	80.9%
21	Qwen2.5 Coder 3B Instruct · 3.1B	80.7%
22	Text Davinci 003 · proprietary	78.2%
23	Yi 34B Chat · 34.4B	76.0%
24	Qwen2.5 Coder 3B · 3.1B	75.7%
25	Mixtral 8x7B v0.1 · 46.7B	74.4%
26	Llama 2 70B HF · 69.0B	69.6%
27	StableBeluga2 · 70B	69.6%
28	Yi 34B · 34.4B	67.2%
29	DeepSeek Coder v2 Lite Base · 15.7B	67.1%
30	Qwen2.5 Coder 1.5B · 1.5B	65.8%
31	Internlm 20B · 20B	62.9%
32	Qwen 14B · 14.2B	61.3%
33	Qwen 14B Chat · 14.2B	61.2%
34	Llama 2 70B Chat HF · 69.0B	58.7%
35	GPT 3.5 Turbo (Jun 13) · proprietary	57.8%
36	Starcoder2 15B · 16.0B	57.7%
37	Code Davinci 002 · proprietary	56.8%
38	PaLM 540B · proprietary	56.5%
39	Falcon 180B · 180B	54.4%
40	Llama 65B · proprietary	54.4%
41	Mistral 7B v0.1 · 7B	54.4%
42	Falcon 11B · 11.1B	53.8%
43	Baichuan2 13B Base · 13B	52.8%
44	Qwen 7B · 7.7B	51.7%
45	Gemma 7B · 8.5B	46.4%
46	Nemotron 4 15B · proprietary	46.0%
47	Baichuan2 13B Chat · 13B	45.7%
48	Yi 6B Chat · 6.1B	44.9%
49	Llama 33B · proprietary	44.1%
50	Llama 2 34B · proprietary	42.2%
51	Text Davinci 002 · proprietary	41.5%
52	INTELLECT 1 Instruct · 10.2B	38.6%
53	CodeQwen1.5 7B · 7.3B	37.7%
54	Llama 2 13B Chat HF · 13.0B	36.9%
55	DeepSeek Coder 33B Base · proprietary	35.4%
56	Mistral 7B Instruct v0.2 · 7.2B	35.4%
57	Qwen2.5 Coder 0.5B · 494M	34.5%
58	Mpt 30B Instruct · proprietary	34.4%
59	Llama 2 13B HF · 13.0B	34.3%
60	Falcon 40B Instruct · 40B	33.8%
61	PaLM 62B · proprietary	33.0%
62	Starcoder2 7B · 7.2B	32.7%
63	Yi 6B · 6.1B	32.5%
64	Chatglm2 6B · 6B	32.4%
65	Internlm 7B · 7B	31.2%
66	Vicuna 13B v1.1 · proprietary	28.1%
67	Baichuan 13B Base · 13B	26.8%
68	Falcon 40B · 41.8B	25.0%
69	Baichuan2 7B Base · 7B	24.6%
70	Vicuna 13B V1.3 · 13B	22.6%
71	Starcoder2 3B · 3.0B	21.6%
72	DeepSeek Coder 6.7b Base · proprietary	21.3%
73	Qwen 1 8B · 1.8B	21.2%
74	Llama 13B · proprietary	20.5%
75	Gemma 2B · 2.5B	17.7%
76	Llama 2 7B HF · 6.7B	16.7%
77	Mpt 30B · proprietary	16.4%
78	Internlm Chat 20B · 20B	15.7%
79	Llama 7B · 6.7B	11.0%
80	Bloom · 176.2B	9.5%
81	Baichuan 7B · 7B	9.2%
82	Mpt 7B · proprietary	9.1%
83	Davinci · proprietary	9.0%
84	Falcon 7B · 7.2B	6.8%
85	Deepseek Coder 1.3B Base · 1.3B	4.4%
86	Opt 175B · proprietary	4.0%
87	Opt 66B · proprietary	1.8%
88	Curie · proprietary	1.6%
89	Babbage · proprietary	0.7%
90	Ada · proprietary	0.6%
91	Text Curie 001 · proprietary	0.6%
92	Text Ada 001 · proprietary	0.4%
93	Text Babbage 001 · proprietary	0.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

GSM8K: frequently asked questions

What is the best open LLM on GSM8K?: DeepSeek Coder v2 Instruct is the top open model on GSM8K, scoring 94.5%. Among all models tested — including proprietary ones — it ranks #1. That puts it ahead of every proprietary model we track, including GPT 4 (Mar 14) (OpenAI) at 92.0%.
What's the best GSM8K model you can run on a 24 GB GPU?: Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.
What's the best GSM8K model you can run on a 12 GB GPU?: Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.
Can open models match proprietary models on GSM8K?: Yes — the best open model (DeepSeek Coder v2 Instruct, 94.5%) matches or beats every proprietary model we track on GSM8K.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.