Math
GSM8K Leaderboard
GSM8K is a classic set of grade-school math word problems requiring several arithmetic reasoning steps. Long a default sanity check for reasoning, top models now score near the ceiling, so it best separates smaller models.
Source: epoch27 open models ranked+66 proprietaryData through Nov 2024
All models ranked on GSM8K
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | DeepSeek-Coder-V2-Instruct · proprietary | 94.5% |
| 2 | Qwen2.5 Coder 14B Instruct · 14.8B | 94.2% |
| 3 | Qwen2.5 Coder 32B Instruct · 32.8B | 93.0% |
| 4 | gpt-4-0314 · proprietary | 92.0% |
| 5 | gpt-4o-mini-2024-07-18 · proprietary | 91.3% |
| 6 | Qwen2.5-Coder-32B · proprietary | 91.1% |
| 7 | gpt-4-0613 · proprietary | 90.0% |
| 8 | Phi-3.5-MoE-instruct · proprietary | 88.7% |
| 9 | Qwen2.5 Coder 14B · 14.8B | 88.7% |
| 10 | DeepSeek Coder v2 Lite Instruct · 15.7B | 87.6% |
| 11 | claude-instant-1.2 · proprietary | 86.7% |
| 12 | Qwen2.5 Coder 7B Instruct · 7.6B | 86.7% |
| 13 | Phi 3.5 Mini Instruct · 3.8B | 86.2% |
| 14 | DeepSeek-Coder-V2-Base · proprietary | 85.8% |
| 15 | Gemma 2 9B · 9.2B | 84.9% |
| 16 | Mistral-Nemo-Base-2407 · proprietary | 84.2% |
| 17 | Qwen2.5 Coder 7B · 7.6B | 83.9% |
| 18 | gemini-1.5-flash-001 · proprietary | 82.4% |
| 19 | Llama 3.1 8B Instruct · 8.0B | 82.4% |
| 20 | claude-instant-1.1 · proprietary | 80.9% |
| 21 | Qwen2.5 Coder 3B Instruct · 3.1B | 80.7% |
| 22 | Yi-34B-Chat · proprietary | 76.0% |
| 23 | Qwen2.5-Coder-3B · proprietary | 75.7% |
| 24 | Mixtral-8x7B-v0.1 · proprietary | 74.4% |
| 25 | StableBeluga2 · proprietary | 69.6% |
| 26 | Yi-34B · proprietary | 67.2% |
| 27 | DeepSeek Coder v2 Lite Base · 15.7B | 67.1% |
| 28 | Qwen2.5 Coder 1.5B · 1.5B | 65.8% |
| 29 | Llama 2 70B HF · 69.0B | 63.3% |
| 30 | Qwen-14B · proprietary | 61.3% |
| 31 | Qwen-14B-Chat · proprietary | 61.2% |
| 32 | Llama 2 70B Chat · 70B | 58.7% |
| 33 | gpt-3.5-turbo-0613 · proprietary | 57.8% |
| 34 | Starcoder2 15B · 16.0B | 57.7% |
| 35 | text-davinci-003 · proprietary | 57.1% |
| 36 | code-davinci-002 · proprietary | 56.8% |
| 37 | PaLM 540B · proprietary | 56.5% |
| 38 | Falcon 180B · 180B | 54.4% |
| 39 | falcon-11b · proprietary | 53.8% |
| 40 | Baichuan-2-13B-Base · proprietary | 52.8% |
| 41 | Qwen-7B · proprietary | 51.7% |
| 42 | LLaMA-65B · proprietary | 50.9% |
| 43 | Mistral 7B v0.1 · 7B | 50.0% |
| 44 | Gemma 7B · 8.5B | 46.4% |
| 45 | Nemotron-4 15B · proprietary | 46.0% |
| 46 | Yi-6B-Chat · proprietary | 44.9% |
| 47 | internlm-20b · proprietary | 43.4% |
| 48 | LLaMA-33B · proprietary | 42.3% |
| 49 | Llama-2-34b · proprietary | 42.2% |
| 50 | text-davinci-002 · proprietary | 41.5% |
| 51 | INTELLECT-1-Instruct · proprietary | 38.6% |
| 52 | CodeQwen1.5-7B · proprietary | 37.7% |
| 53 | deepseek-coder-33b-base · proprietary | 35.4% |
| 54 | Mistral 7B Instruct v0.2 · 7B | 35.4% |
| 55 | Qwen2.5 Coder 0.5B · 494M | 34.5% |
| 56 | mpt-30b-instruct · proprietary | 34.4% |
| 57 | falcon-40b-instruct · proprietary | 33.8% |
| 58 | PaLM 62B · proprietary | 33.0% |
| 59 | Starcoder2 7B · 7.2B | 32.7% |
| 60 | Yi-6B · proprietary | 32.5% |
| 61 | chatglm2-6b · proprietary | 32.4% |
| 62 | internlm-7b · proprietary | 31.2% |
| 63 | Llama-2-13b · proprietary | 29.6% |
| 64 | vicuna-13b-v1.1 · proprietary | 28.1% |
| 65 | Baichuan-13B-Base · proprietary | 26.8% |
| 66 | Baichuan-2-7B-Base · proprietary | 24.6% |
| 67 | Baichuan2-13B-Chat · proprietary | 23.3% |
| 68 | vicuna-13b-v1.3 · proprietary | 22.6% |
| 69 | starcoder2-3b · proprietary | 21.6% |
| 70 | falcon-40b · proprietary | 21.5% |
| 71 | deepseek-coder-6.7b-base · proprietary | 21.3% |
| 72 | Qwen-1_8B · proprietary | 21.2% |
| 73 | LLaMA-13B · proprietary | 20.3% |
| 74 | Gemma 2B · 2.5B | 17.7% |
| 75 | internlm-chat-20b · proprietary | 15.7% |
| 76 | mpt-30b · proprietary | 15.2% |
| 77 | Llama 2 7B · 7B | 14.6% |
| 78 | Llama 7B · 6.7B | 11.0% |
| 79 | Bloom · 176.2B | 9.5% |
| 80 | Baichuan-7B · proprietary | 9.2% |
| 81 | davinci · proprietary | 9.0% |
| 82 | mpt-7b · proprietary | 6.8% |
| 83 | Falcon 7B · 7.2B | 4.6% |
| 84 | Deepseek Coder 1.3B Base · 1.3B | 4.4% |
| 85 | opt-175b · proprietary | 4.0% |
| 86 | Llama-2-13b-chat · proprietary | 2.7% |
| 87 | opt-66b · proprietary | 1.8% |
| 88 | curie · proprietary | 1.6% |
| 89 | babbage · proprietary | 0.7% |
| 90 | ada · proprietary | 0.6% |
| 91 | text-curie-001 · proprietary | 0.6% |
| 92 | text-ada-001 · proprietary | 0.4% |
| 93 | text-babbage-001 · proprietary | 0.0% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- Qwen2.5 Coder 0.5B, 494M, score 34.5% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 Coder 1.5B, 2B, score 65.8% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 Coder 3B Instruct, 3B, score 80.7% — on the efficiency frontier (best score at its size or smaller).
- Phi 3.5 Mini Instruct, 4B, score 86.2% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 Coder 7B Instruct, 8B, score 86.7% — on the efficiency frontier (best score at its size or smaller).
- Qwen2.5 Coder 14B Instruct, 15B, score 94.2% — on the efficiency frontier (best score at its size or smaller).
GSM8K: frequently asked questions
- What is the best open LLM on GSM8K?
- Qwen2.5 Coder 14B Instruct is the top open model on GSM8K, scoring 94.2%. Among all models tested — including proprietary ones — it ranks #2.
- What's the best GSM8K model you can run on a 24 GB GPU?
- Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.
- What's the best GSM8K model you can run on a 12 GB GPU?
- Qwen2.5 Coder 14B Instruct is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 8 GB), scoring 94.2% on GSM8K.
- Can open models match proprietary models on GSM8K?
- Not quite on GSM8K: the strongest proprietary model (DeepSeek-Coder-V2-Instruct) scores 94.5%, ahead of the best open model (Qwen2.5 Coder 14B Instruct) at 94.2% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.