What is the best open LLM on LiveBench Reasoning?

Kimi K2.7 Code is the top open model on LiveBench Reasoning, scoring 82.8. Among all models tested — including proprietary ones — it ranks #22. The top model overall is GPT 5.6 Sol Max (OpenAI) at 91.7.

What's the best LiveBench Reasoning model you can run on a 24 GB GPU?

Qwen3.6 27B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 15 GB), scoring 70.3 on LiveBench Reasoning.

Can open models match proprietary models on LiveBench Reasoning?

Not quite on LiveBench Reasoning: the strongest proprietary model (GPT 5.6 Sol Max) scores 91.7, ahead of the best open model (Kimi K2.7 Code) at 82.8 — but you can run the open one yourself.

Reasoning

LiveBench Reasoning Leaderboard

Name: LiveBench Reasoning — open LLM scores
Creator: livebench

LiveBench Reasoning measures logical, multi-step reasoning using contamination-free questions that are refreshed regularly, so models cannot have trained on the test set.

Source: livebench8 open models ranked+29 proprietaryData through Jun 2026

Open models All models

All models ranked on LiveBench Reasoning

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#	Model	Score
1	GPT 5.6 Sol Max · proprietary	91.7
2	Kimi K3 · proprietary	90.7
3	GPT 5.6 Terra Max · proprietary	90.6
4	GPT 5.6 Sol · proprietary	90.2
5	Claude Opus 4.8 · proprietary	89.7
6	Claude Fable 5 Max · proprietary	89.7
7	GPT 5.5 · proprietary	89.7
8	Claude Sonnet 5 · proprietary	88.7
9	Claude Opus 4.6 · proprietary	88.7
10	GPT 5.4 · proprietary	88.1
11	Muse Spark 1.1 · proprietary	87.7
12	Claude Fable 5 · proprietary	87.7
13	Claude Opus 4.7 · proprietary	87.2
14	Grok 4.5 · proprietary	87.2
15	GPT 5.6 Luna Max · proprietary	85.6
16	GPT 5.6 Terra · proprietary	84.9
17	Claude Sonnet 4.6 · proprietary	84.8
18	GPT 5.6 Luna · proprietary	84.7
19	Gemini 3.1 Pro · proprietary	84.0
20	Qwen3.7 Max · proprietary	83.3
21	GPT 5.2 · proprietary	83.2
22	Kimi K2.7 Code · 1058.6B	82.8
23	DeepSeek V4 Pro · 861.6B	82.7
24	Gemini 3.5 Flash · proprietary	82.0
25	GPT 5.4 Nano · proprietary	81.1
26	Claude Opus 4.5 · proprietary	80.1
27	Kimi K2.6 · 1058.6B	79.4
28	GLM 5.2 · 753.3B	78.6
29	Inkling · 952.4B	78.3
30	GPT 5.2 Codex · proprietary	77.7
31	Grok Build 0.1 · proprietary	76.4
32	Qwen3.6 Plus · proprietary	75.8
33	MiniMax M3 · 427.0B	74.5
34	GPT 5.4 Mini · proprietary	71.3
35	Grok 4.3 · proprietary	70.8
36	DeepSeek V4 Flash · 158.1B	70.6
37	Qwen3.6 27B · 27.8B	70.3

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

LiveBench Reasoning: frequently asked questions

What is the best open LLM on LiveBench Reasoning?: Kimi K2.7 Code is the top open model on LiveBench Reasoning, scoring 82.8. Among all models tested — including proprietary ones — it ranks #22. The top model overall is GPT 5.6 Sol Max (OpenAI) at 91.7.
What's the best LiveBench Reasoning model you can run on a 24 GB GPU?: Qwen3.6 27B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 15 GB), scoring 70.3 on LiveBench Reasoning.
Can open models match proprietary models on LiveBench Reasoning?: Not quite on LiveBench Reasoning: the strongest proprietary model (GPT 5.6 Sol Max) scores 91.7, ahead of the best open model (Kimi K2.7 Code) at 82.8 — but you can run the open one yourself.

Scores aggregated from livebench. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.