What is the best open LLM on Humanity's Last Exam?

Kimi K2.5 is the top open model on Humanity's Last Exam, scoring 24.4%. Among all models tested — including proprietary ones — it ranks #13. The top model overall is Gemini 3.1 Pro Preview (Google DeepMind) at 46.4%.

Can open models match proprietary models on Humanity's Last Exam?

Not quite on Humanity's Last Exam: the strongest proprietary model (Gemini 3.1 Pro Preview) scores 46.4%, ahead of the best open model (Kimi K2.5) at 24.4% — but you can run the open one yourself.

Knowledge

Humanity's Last Exam Leaderboard

Name: Humanity's Last Exam — open LLM scores
Creator: epoch

Humanity's Last Exam (HLE) is a set of extremely difficult, expert-written questions across many fields, designed so that even frontier models score low. It is built to stay hard as models improve, measuring the true knowledge frontier.

Source: epoch4 open models ranked+42 proprietaryData through Apr 2026

Open models All models

All models ranked on Humanity's Last Exam

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#	Model	Score
1	Gemini 3.1 Pro Preview · proprietary	46.4%
2	GPT 5.4 Pro (Mar 05, 2026, unspecified) · proprietary	44.3%
3	Muse Spark · proprietary	40.6%
4	Gemini 3 Pro Preview · proprietary	37.5%
5	GPT 5.4 (Mar 05, 2026, xhigh) · proprietary	36.2%
6	Claude Opus 4.7 (unspecified) · proprietary	36.2%
7	Claude Opus 4.6 Max · proprietary	34.4%
8	GPT 5 Pro (Oct 06, 2025, unspecified) · proprietary	31.6%
9	GPT 5.2 (Dec 11, 2025, unspecified) · proprietary	27.8%
10	GPT 5 (Aug 07, 2025, high) · proprietary	25.3%
11	GPT 5 (Aug 07, 2025, unspecified) · proprietary	25.3%
12	Claude Opus 4.5 (Nov 01, 2025, unspecified) · proprietary	25.2%
13	Kimi K2.5 · 1058.6B	24.4%
14	GPT 5.1 (Nov 13, 2025, unspecified) · proprietary	23.7%
15	Gemini 2.5 Pro Preview (Jun 05) · proprietary	21.6%
16	O3 (Apr 16, 2025, high) · proprietary	20.3%
17	GPT 5 Mini (Aug 07, 2025, unspecified) · proprietary	19.4%
18	O3 (Apr 16, 2025, medium) · proprietary	19.2%
19	Claude Opus 4.6 · proprietary	19.0%
20	Gemini 2.5 Pro Exp (Mar 25) · proprietary	18.2%
21	O4 Mini (Apr 16, 2025, high) · proprietary	18.1%
22	Gemini 2.5 Pro Preview (May 06) · proprietary	17.8%
23	O4 Mini (Apr 16, 2025, medium) · proprietary	14.3%
24	Claude Sonnet 4.5 (Sep 29, 2025, unspecified) · proprietary	13.7%
25	Gemini 2.5 Flash Preview (Apr 17) · proprietary	12.1%
26	Claude Opus 4.1 (Aug 05, 2025, unspecified) · proprietary	11.5%
27	Gemini 2.5 Flash Preview (May 20) · proprietary	11.0%
28	Claude Opus 4 (May 14, 2025, unspecified) · proprietary	10.7%
29	Gemini 3.1 Flash Lite · proprietary	8.6%
30	GLM 4.5 · 358.3B	8.3%
31	GLM 4.5 Air · 110.5B	8.1%
32	O1 Pro (Mar 19, 2025) · proprietary	8.1%
33	Claude 3.7 Sonnet (Feb 19, 2025, unspecified) · proprietary	8.0%
34	O1 (Dec 17, 2024, unspecified) · proprietary	8.0%
35	Claude Sonnet 4 (May 14, 2025, unspecified) · proprietary	7.8%
36	GPT 5.1 2025 11.13 None · proprietary	6.8%
37	Gemini 2.0 Flash Thinking Exp (Jan 21) · proprietary	6.6%
38	Llama 4 Maverick 17B 128E Instruct · 401.6B	5.7%
39	GPT 4.5 Preview (Feb 27, 2025) · proprietary	5.4%
40	GPT 4.1 (Apr 14, 2025) · proprietary	5.4%
41	Gemini 1.5 Pro 002 · proprietary	4.6%
42	Mistral Medium 2505 · proprietary	4.5%
43	Amazon.nova Pro v1:0 · proprietary	4.4%
44	Claude 3.5 Sonnet (Oct 22, 2024) · proprietary	4.1%
45	Amazon.nova Lite v1:0 · proprietary	3.6%
46	GPT 4o (Nov 20, 2024) · proprietary	2.7%

Humanity's Last Exam: frequently asked questions

What is the best open LLM on Humanity's Last Exam?: Kimi K2.5 is the top open model on Humanity's Last Exam, scoring 24.4%. Among all models tested — including proprietary ones — it ranks #13. The top model overall is Gemini 3.1 Pro Preview (Google DeepMind) at 46.4%.
Can open models match proprietary models on Humanity's Last Exam?: Not quite on Humanity's Last Exam: the strongest proprietary model (Gemini 3.1 Pro Preview) scores 46.4%, ahead of the best open model (Kimi K2.5) at 24.4% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.