Question 1

What is the best open LLM on Terminal-Bench?

Accepted Answer

GLM 5 is the top open model on Terminal-Bench, scoring 52.4%. Among all models tested — including proprietary ones — it ranks #22. The top model overall is Claude Opus 4.7 (unspecified) (Anthropic) at 90.2%.

Question 2

What's the best Terminal-Bench model you can run on a 24 GB GPU?

Accepted Answer

Qwen3.6 35B A3B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 20 GB), scoring 24.6% on Terminal-Bench.

Question 3

What's the best Terminal-Bench model you can run on a 12 GB GPU?

Accepted Answer

Qwen3.5 9B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 5 GB), scoring 9.2% on Terminal-Bench.

Question 4

Can open models match proprietary models on Terminal-Bench?

Accepted Answer

Not quite on Terminal-Bench: the strongest proprietary model (Claude Opus 4.7 (unspecified)) scores 90.2%, ahead of the best open model (GLM 5) at 52.4% — but you can run the open one yourself.

#	Model	Score
1 / 22	GLM 5 · 753.9B	52.4%
2 / 28	MiniMax M2.7 · 228.7B	45.1%
3 / 30	Kimi K2.5 · 1058.6B	43.2%
4 / 32	MiniMax M2.5 · 228.7B	42.7%
5 / 33	DeepSeek V3.2 · 685.4B	39.6%
6 / 36	MiniMax M2.1 · 228.7B	36.6%
7 / 37	Kimi K2 Thinking · 1058.1B	35.7%
8 / 40	GLM 4.7 · 358.3B	33.4%
9 / 43	MiniMax M2 · 228.7B	30.0%
10 / 45	Kimi K2 Instruct · 1026.5B	27.8%
11 / 47	Qwen3 Coder 480B A35B Instruct · 480.2B	27.2%
12 / 49	Qwen3.6 35B A3B · 36.0B	24.6%
13 / 50	GLM 4.6 · 356.8B	24.5%
14 / 52	GPT OSS 120B · 120.4B	18.7%
15 / 56	Qwen3.5 9B · 9.7B	9.2%
16 / 57	GPT OSS 20B · 21.5B	3.4%

Terminal-Bench Leaderboard

Open models ranked on Terminal-Bench

Score vs model size

Terminal-Bench: frequently asked questions