What is the best open LLM on FrontierMath?

Kimi K2.6 is the top open model on FrontierMath, scoring 39.0%. Among all models tested — including proprietary ones — it ranks #14. The top model overall is GPT 5.5 Pro Pre Release (high) (OpenAI) at 52.4%.

Can open models match proprietary models on FrontierMath?

Not quite on FrontierMath: the strongest proprietary model (GPT 5.5 Pro Pre Release (high)) scores 52.4%, ahead of the best open model (Kimi K2.6) at 39.0% — but you can run the open one yourself.

Math

FrontierMath Leaderboard

Name: FrontierMath — open LLM scores
Creator: epoch

FrontierMath is a benchmark of exceptionally hard, original research-level mathematics problems created with professional mathematicians. Even the strongest models solve only a small fraction, making it a frontier measure of genuine mathematical ability.

Source: epoch12 open models ranked+89 proprietaryData through May 2026

Open models All models

All models ranked on FrontierMath

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#	Model	Score
1	GPT 5.5 Pro Pre Release (high) · proprietary	52.4%
2	GPT 5.5 Pre Release (xhigh) · proprietary	51.7%
3	GPT 5.5 Pro Pre Release (xhigh) · proprietary	51.0%
4	GPT 5.4 Pro (Mar 05, 2026, xhigh) · proprietary	50.0%
5	GPT 5.4 (Mar 05, 2026, xhigh) · proprietary	47.6%
6	Claude Opus 4.8 Max · proprietary	47.2%
7	Claude Opus 4.7 (xhigh) · proprietary	43.8%
8	Claude Opus 4.6 Max · proprietary	40.7%
9	GPT 5.2 (Dec 11, 2025, xhigh) · proprietary	40.7%
10	GPT 5.2 (Dec 11, 2025, high) · proprietary	40.3%
11	Claude Opus 4.6 (32K) · proprietary	40.0%
12	Claude Opus 4.6 (64K) · proprietary	39.7%
13	Muse Spark · proprietary	39.0%
14	Gemini 3.5 Flash (high) · proprietary	39.0%
15	Kimi K2.6 · 1058.6B	39.0%
16	Claude Opus 4.6 · proprietary	38.3%
17	Gemini 3 Pro Preview · proprietary	37.6%
18	Gemini 3.1 Pro Preview · proprietary	36.9%
19	GPT 5.2 (Dec 11, 2025, medium) · proprietary	36.9%
20	Gemini 3 Flash Preview · proprietary	35.6%
21	GLM 5.1 · 753.9B	33.5%
22	GPT 5 (Aug 07, 2025, high) · proprietary	32.4%
23	Claude Sonnet 4.6 (16K) · proprietary	32.4%
24	GPT 5.1 (Nov 13, 2025, high) · proprietary	31.0%
25	Gemini 2.5 Deep Think Webapp (Aug 01, 2025) · proprietary	29.0%
26	GPT 5.4 Mini (Mar 17, 2026, high) · proprietary	28.3%
27	Kimi K2.5 · 1058.6B	27.9%
28	GPT 5 (Aug 07, 2025, medium) · proprietary	27.2%
29	GPT 5 Mini (Aug 07, 2025, high) · proprietary	27.2%
30	GPT 5.1 (Nov 13, 2025, medium) · proprietary	26.9%
31	GPT 5.2 (Dec 11, 2025, low) · proprietary	26.6%
32	Qwen3.6 Plus · proprietary	26.2%
33	GPT 5.4 Nano (Mar 17, 2026, high) · proprietary	25.9%
34	O4 Mini (Apr 16, 2025, high) · proprietary	24.8%
35	Qwen3.6 Max Preview · proprietary	23.1%
36	DeepSeek V3.2 · 685.4B	22.1%
37	Kimi K2 Thinking · 1058.1B	21.4%
38	Qwen3.5 Plus · proprietary	21.0%
39	Claude Opus 4.5 (Nov 01, 2025, 32K) · proprietary	20.7%
40	Claude Opus 4.5 (Nov 01, 2025) · proprietary	20.7%
41	Claude Opus 4.5 (Nov 01, 2025, 16K) · proprietary	20.3%
42	GPT 5 Mini (Aug 07, 2025, medium) · proprietary	20.3%
43	Grok 4 (Jul 09) · proprietary	19.7%
44	O4 Mini (Apr 16, 2025, medium) · proprietary	19.0%
45	O3 (Apr 16, 2025, high) · proprietary	18.7%
46	GPT 5.1 (Nov 13, 2025, low) · proprietary	17.3%
47	O3 (Apr 16, 2025, medium) · proprietary	16.9%
48	GLM 5 · 753.9B	16.4%
49	Claude Sonnet 4.5 (Sep 29, 2025, 32K) · proprietary	15.2%
50	Gemini 2.5 Pro · proprietary	14.1%
51	Claude Sonnet 4.5 (Sep 29, 2025, 59K) · proprietary	13.5%
52	O3 Mini (Jan 31, 2025, high) · proprietary	12.4%
53	O4 Mini (Apr 16, 2025, low) · proprietary	10.7%
54	Gemini 2.5 Pro Preview (Jun 05) · proprietary	10.3%
55	Qwen3.6 Flash · proprietary	10.3%
56	O3 (Apr 16, 2025, low) · proprietary	9.7%
57	Claude Sonnet 4.5 (Sep 29, 2025) · proprietary	9.3%
58	O1 (Dec 17, 2024, high) · proprietary	9.3%
59	Qwen3 235B A22B Thinking 2507 · 235.1B	8.5%
60	GPT 5 Nano (Aug 07, 2025, high) · proprietary	8.3%
61	O3 Mini (Jan 31, 2025, medium) · proprietary	8.1%
62	Claude Opus 4.1 (Aug 05, 2025, 27K) · proprietary	7.2%
63	GPT 5 Nano (Aug 07, 2025, medium) · proprietary	7.2%
64	Qwen3.5 Flash · proprietary	6.2%
65	Claude Haiku 4.5 (Oct 01, 2025, 32K) · proprietary	5.9%
66	Claude Opus 4.1 (Aug 05, 2025) · proprietary	5.9%
67	Grok 3 Mini Beta (high) · proprietary	5.9%
68	GPT 4.1 (Apr 14, 2025) · proprietary	5.5%
69	Gemini 2.5 Flash · proprietary	4.8%
70	Claude Opus 4 (May 14, 2025) · proprietary	4.5%
71	GPT 4.1 Mini (Apr 14, 2025) · proprietary	4.5%
72	Claude 3.7 Sonnet (Feb 19, 2025, 16K) · proprietary	4.1%
73	Claude Haiku 4.5 (Oct 01, 2025) · proprietary	4.1%
74	Claude Opus 4 (May 14, 2025, 27K) · proprietary	4.1%
75	Claude Sonnet 4 (May 14, 2025) · proprietary	4.1%
76	GLM 4.6 · 356.8B	3.8%
77	Grok 3 Beta · proprietary	3.8%
78	Claude 3.7 Sonnet (Feb 19, 2025, 32K) · proprietary	3.5%
79	Claude 3.7 Sonnet (Feb 19, 2025, 64K) · proprietary	3.1%
80	Claude 3.7 Sonnet (Feb 19, 2025) · proprietary	3.1%
81	Grok 3 Mini Beta (low) · proprietary	2.8%
82	GLM 4.7 · 358.3B	2.4%
83	GPT 5.1 2025 11.13 None · proprietary	2.1%
84	Claude 3.5 Sonnet (Oct 22, 2024) · proprietary	2.1%
85	DeepSeek v3 · 684.5B	1.7%
86	Gemini 2.0 Flash 001 · proprietary	1.7%
87	O1 Mini (Sep 12, 2024, medium) · proprietary	1.7%
88	Qwen Plus (Apr 28, 2025) · proprietary	1.7%
89	O1 Mini (Sep 12, 2024, high) · proprietary	1.4%
90	Claude 3.5 Sonnet (Jun 20, 2024) · proprietary	1.0%
91	GPT 4.1 Nano (Apr 14, 2025) · proprietary	1.0%
92	Qwen Max (Jan 25, 2025) · proprietary	1.0%
93	Grok 2 (Dec 12) · proprietary	0.7%
94	Llama 4 Maverick 17B 128E Instruct · 401.6B	0.7%
95	Mistral Medium 2505 · proprietary	0.4%
96	Claude 3.5 Haiku (Oct 22, 2024) · proprietary	0.3%
97	GPT 4o (Aug 06, 2024) · proprietary	0.3%
98	GPT 4o (Nov 20, 2024) · proprietary	0.3%
99	Mistral Large 2411 · proprietary	0.3%
100	Gemini 1.5 Flash 002 · proprietary	0.0%
101	Llama 4 Scout 17B 16E Instruct · 108.6B	0.0%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

FrontierMath: frequently asked questions

What is the best open LLM on FrontierMath?: Kimi K2.6 is the top open model on FrontierMath, scoring 39.0%. Among all models tested — including proprietary ones — it ranks #14. The top model overall is GPT 5.5 Pro Pre Release (high) (OpenAI) at 52.4%.
Can open models match proprietary models on FrontierMath?: Not quite on FrontierMath: the strongest proprietary model (GPT 5.5 Pro Pre Release (high)) scores 52.4%, ahead of the best open model (Kimi K2.6) at 39.0% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.