What is the best open LLM on Terminal-Bench?

GLM 5 is the top open model on Terminal-Bench, scoring 52.4%. Among all models tested — including proprietary ones — it ranks #22. The top model overall is Claude Opus 4.7 (unspecified) (Anthropic) at 90.2%.

What's the best Terminal-Bench model you can run on a 24 GB GPU?

Qwen3.6 35B A3B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 20 GB), scoring 24.6% on Terminal-Bench.

What's the best Terminal-Bench model you can run on a 12 GB GPU?

Qwen3.5 9B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 9.2% on Terminal-Bench.

Can open models match proprietary models on Terminal-Bench?

Not quite on Terminal-Bench: the strongest proprietary model (Claude Opus 4.7 (unspecified)) scores 90.2%, ahead of the best open model (GLM 5) at 52.4% — but you can run the open one yourself.

Coding

Terminal-Bench Leaderboard

Name: Terminal-Bench — open LLM scores
Creator: epoch

Terminal-Bench measures whether a model can complete real, end-to-end tasks in a command-line environment — running commands, editing files, and chaining steps — making it an agentic test of practical software skill.

Source: epoch16 open models ranked+41 proprietaryData through Apr 2026

Open models All models

All models ranked on Terminal-Bench

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#	Model	Score
1	Claude Opus 4.7 (unspecified) · proprietary	90.2%
2	GPT 5.5 (unspecified) · proprietary	84.7%
3	GPT 5.4 (Mar 05, 2026, unspecified) · proprietary	81.8%
4	Gemini 3.1 Pro Preview · proprietary	80.2%
5	Claude Opus 4.6 (unspecified) · proprietary	79.8%
6	GPT 5.3 Codex · proprietary	78.4%
7	Claude Opus 4.6 · proprietary	69.9%
8	Gemini 3 Pro Preview · proprietary	69.4%
9	GPT 5.2 Codex · proprietary	66.5%
10	GPT 5.2 (Dec 11, 2025, medium) · proprietary	64.9%
11	GPT 5.2 (Dec 11, 2025, unspecified) · proprietary	64.9%
12	Gemini 3 Flash Preview · proprietary	64.3%
13	Claude Opus 4.5 (Nov 01, 2025, unspecified) · proprietary	63.1%
14	Claude Opus 4.5 (Nov 01, 2025) · proprietary	63.1%
15	GPT 5.1 Codex Mini · proprietary	61.6%
16	GPT 5.1 Codex Max · proprietary	60.4%
17	Claude Opus 4.5 (Nov 01, 2025, 128K) · proprietary	59.1%
18	GPT 5.1 Codex · proprietary	57.8%
19	Grok 4.20 · proprietary	57.3%
20	Claude Sonnet 4.6 · proprietary	53.4%
21	Claude Sonnet 4.6 (unspecified) · proprietary	53.4%
22	GLM 5 · 753.9B	52.4%
23	GPT 5 (Aug 07, 2025, medium) · proprietary	49.6%
24	GPT 5 (Aug 07, 2025, unspecified) · proprietary	49.6%
25	GPT 5.1 (Nov 13, 2025, medium) · proprietary	47.6%
26	GPT 5.1 (Nov 13, 2025, unspecified) · proprietary	47.6%
27	Claude Sonnet 4.5 (Sep 29, 2025, unspecified) · proprietary	46.5%
28	MiniMax M2.7 · 228.7B	45.1%
29	GPT 5 Codex · proprietary	44.3%
30	Kimi K2.5 · 1058.6B	43.2%
31	Claude Sonnet 4.5 (Sep 29, 2025) · proprietary	42.8%
32	MiniMax M2.5 · 228.7B	42.7%
33	DeepSeek V3.2 · 685.4B	39.6%
34	Claude Opus 4.1 (Aug 05, 2025, unspecified) · proprietary	38.0%
35	Claude Opus 4.1 (Aug 05, 2025) · proprietary	38.0%
36	MiniMax M2.1 · 228.7B	36.6%
37	Kimi K2 Thinking · 1058.1B	35.7%
38	Claude Haiku 4.5 (Oct 01, 2025, unspecified) · proprietary	35.5%
39	GPT 5 Mini (Aug 07, 2025, unspecified) · proprietary	34.8%
40	GLM 4.7 · 358.3B	33.4%
41	Gemini 2.5 Pro · proprietary	32.6%
42	GPT 5 Mini (Aug 07, 2025, medium) · proprietary	31.9%
43	MiniMax M2 · 228.7B	30.0%
44	Claude Haiku 4.5 (Oct 01, 2025) · proprietary	29.8%
45	Kimi K2 Instruct · 1026.5B	27.8%
46	Grok 4 (Jul 09) · proprietary	27.2%
47	Qwen3 Coder 480B A35B Instruct · 480.2B	27.2%
48	Grok Code Fast 1 · proprietary	25.8%
49	Qwen3.6 35B A3B · 36.0B	24.6%
50	GLM 4.6 · 356.8B	24.5%
51	GPT 5 Nano (Aug 07, 2025, unspecified) · proprietary	21.8%
52	GPT OSS 120B · 120.4B	18.7%
53	Gemini 2.5 Flash · proprietary	17.1%
54	Gemini 2.5 Flash Preview (Sep 2025) · proprietary	17.1%
55	GPT 5 Nano (Aug 07, 2025, medium) · proprietary	11.5%
56	Qwen3.5 9B · 9.7B	9.2%
57	GPT OSS 20B · 21.5B	3.4%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.

Terminal-Bench: frequently asked questions

What is the best open LLM on Terminal-Bench?: GLM 5 is the top open model on Terminal-Bench, scoring 52.4%. Among all models tested — including proprietary ones — it ranks #22. The top model overall is Claude Opus 4.7 (unspecified) (Anthropic) at 90.2%.
What's the best Terminal-Bench model you can run on a 24 GB GPU?: Qwen3.6 35B A3B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 20 GB), scoring 24.6% on Terminal-Bench.
What's the best Terminal-Bench model you can run on a 12 GB GPU?: Qwen3.5 9B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 9.2% on Terminal-Bench.
Can open models match proprietary models on Terminal-Bench?: Not quite on Terminal-Bench: the strongest proprietary model (Claude Opus 4.7 (unspecified)) scores 90.2%, ahead of the best open model (GLM 5) at 52.4% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.