Coding

Terminal-Bench Leaderboard

Terminal-Bench measures whether a model can complete real, end-to-end tasks in a command-line environment — running commands, editing files, and chaining steps — making it an agentic test of practical software skill.

Source: epoch11 open models ranked+46 proprietaryData through Apr 2026

All models ranked on Terminal-Bench

Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.

#ModelScore
1gpt-5.4-2026-03-05_unknown · proprietary
81.8%
2claude-opus-4-7_unknown · proprietary
80.2%
3claude-opus-4-6 · proprietary
69.9%
4gpt-5.2-codex · proprietary
66.5%
5gpt-5.5_unknown · proprietary
66.1%
6gpt-5.3-codex · proprietary
64.7%
7claude-opus-4-5-20251101 · proprietary
63.1%
8gpt-5.2-2025-12-11_medium · proprietary
62.9%
9gemini-3.1-pro-preview · proprietary
61.4%
10gpt-5.1-codex-max · proprietary
60.4%
11claude-opus-4-5-20251101_128K · proprietary
59.1%
12claude-opus-4-6_unknown · proprietary
58.0%
13grok-4-20 · proprietary
57.3%
14gemini-3-pro-preview · proprietary
54.2%
15gpt-5.2-2025-12-11_unknown · proprietary
54.0%
16claude-sonnet-4-6_unknown · proprietary
53.4%
17GLM 5 · 753.9B
52.4%
18claude-opus-4-5-20251101_unknown · proprietary
51.7%
19gemini-3-flash-preview · proprietary
51.0%
20gpt-5.1-2025-11-13_medium · proprietary
47.6%
21gpt-5.1-2025-11-13_unknown · proprietary
47.6%
22kimi-k2.5 · proprietary
43.2%
23gpt-5.1-codex-mini · proprietary
43.1%
24MiniMax M2.7 · 228.7B
42.9%
25MiniMax M2.5 · 228.7B
42.2%
26gpt-5-codex · proprietary
41.3%
27claude-sonnet-4-5-20250929 · proprietary
40.1%
28claude-sonnet-4-5-20250929_unknown · proprietary
40.1%
29deepseek/deepseek-v3.2 · proprietary
39.6%
30gpt-5.1-codex · proprietary
36.9%
31Kimi K2 Thinking · 1058.1B
35.7%
32claude-opus-4-1-20250805 · proprietary
34.8%
33claude-opus-4-1-20250805_unknown · proprietary
34.8%
34gpt-5-2025-08-07_medium · proprietary
33.9%
35gpt-5-2025-08-07_unknown · proprietary
33.9%
36GLM 4.7 · 358.3B
33.3%
37MiniMax-M2 · proprietary
30.0%
38MiniMax M2.1 · 228.7B
29.2%
39Kimi K2 Instruct · 1026.5B
26.7%
40qwen3.6-35b-a3b · proprietary
24.6%
41GLM 4.6 · 356.8B
24.5%
42Qwen3 Coder 480B A35B Instruct · 480.2B
23.9%
43grok-4-0709 · proprietary
23.1%
44gpt-5-mini-2025-08-07_medium · proprietary
22.2%
45gpt-5-mini-2025-08-07_unknown · proprietary
22.2%
46gemini-2.5-pro · proprietary
16.4%
47gemini-2.5-flash · proprietary
15.4%
48gemini-2.5-flash-preview-09-2025 · proprietary
15.4%
49GPT OSS 120B · 120.4B
14.2%
50gpt-oss-120b_unknown · proprietary
14.2%
51grok-code-fast-1 · proprietary
14.2%
52claude-haiku-4-5-20251001 · proprietary
13.9%
53claude-haiku-4-5-20251001_unknown · proprietary
13.9%
54gpt-5-nano-2025-08-07_medium · proprietary
7.0%
55gpt-5-nano-2025-08-07_unknown · proprietary
7.0%
56GPT OSS 20B · 21.5B
3.1%
57gpt-oss-20b_unknown · proprietary
3.1%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

100B1Tmodel size (log scale) →52.4%3.1%MiniMax M2.5 · 229B · 42.2%Kimi K2 Thinking · 1.1T · 35.7%GLM 4.7 · 358B · 33.3%MiniMax M2.1 · 229B · 29.2%Kimi K2 Instruct · 1T · 26.7%GLM 4.6 · 357B · 24.5%Qwen3 Coder 480B A35B Instruct · 480B · 23.9%GPT OSS 20B · 22B · 3.1%GPT OSS 20BGPT OSS 120B · 120B · 14.2%GPT OSS 120BMiniMax M2.7 · 229B · 42.9%MiniMax M2.7GLM 5 · 754B · 52.4%GLM 5
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • GPT OSS 20B, 22B, score 3.1% — on the efficiency frontier (best score at its size or smaller).
  • GPT OSS 120B, 120B, score 14.2% — on the efficiency frontier (best score at its size or smaller).
  • MiniMax M2.7, 229B, score 42.9% — on the efficiency frontier (best score at its size or smaller).
  • GLM 5, 754B, score 52.4% — on the efficiency frontier (best score at its size or smaller).

Terminal-Bench: frequently asked questions

What is the best open LLM on Terminal-Bench?
GLM 5 is the top open model on Terminal-Bench, scoring 52.4%. Among all models tested — including proprietary ones — it ranks #17.
What's the best Terminal-Bench model you can run on a 24 GB GPU?
GPT OSS 20B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 12 GB), scoring 3.1% on Terminal-Bench.
What's the best Terminal-Bench model you can run on a 12 GB GPU?
GPT OSS 20B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 12 GB), scoring 3.1% on Terminal-Bench.
Can open models match proprietary models on Terminal-Bench?
Not quite on Terminal-Bench: the strongest proprietary model (gpt-5.4-2026-03-05_unknown) scores 81.8%, ahead of the best open model (GLM 5) at 52.4% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.