Coding

Terminal-Bench Leaderboard

Terminal-Bench measures whether a model can complete real, end-to-end tasks in a command-line environment — running commands, editing files, and chaining steps — making it an agentic test of practical software skill.

Source: epoch11 open models ranked+46 proprietaryData through Apr 2026

Open models ranked on Terminal-Bench

# shows rank among open models / rank overall (including proprietary).

#ModelScore
1 / 17GLM 5 · 753.9B
52.4%
2 / 24MiniMax M2.7 · 228.7B
42.9%
3 / 25MiniMax M2.5 · 228.7B
42.2%
4 / 31Kimi K2 Thinking · 1058.1B
35.7%
5 / 36GLM 4.7 · 358.3B
33.3%
6 / 38MiniMax M2.1 · 228.7B
29.2%
7 / 39Kimi K2 Instruct · 1026.5B
26.7%
8 / 41GLM 4.6 · 356.8B
24.5%
9 / 42Qwen3 Coder 480B A35B Instruct · 480.2B
23.9%
10 / 49GPT OSS 120B · 120.4B
14.2%
11 / 56GPT OSS 20B · 21.5B
3.1%

Score vs model size

Which models give the most quality for their size — the ones worth running locally.

100B1Tmodel size (log scale) →52.4%3.1%MiniMax M2.5 · 229B · 42.2%Kimi K2 Thinking · 1.1T · 35.7%GLM 4.7 · 358B · 33.3%MiniMax M2.1 · 229B · 29.2%Kimi K2 Instruct · 1T · 26.7%GLM 4.6 · 357B · 24.5%Qwen3 Coder 480B A35B Instruct · 480B · 23.9%GPT OSS 20B · 22B · 3.1%GPT OSS 20BGPT OSS 120B · 120B · 14.2%GPT OSS 120BMiniMax M2.7 · 229B · 42.9%MiniMax M2.7GLM 5 · 754B · 52.4%GLM 5
Each dot is a model. Up = higher score, left = smaller (easier to run locally). The dashed line marks the efficiency frontier — the best score you can get at each size or smaller.
  • GPT OSS 20B, 22B, score 3.1% — on the efficiency frontier (best score at its size or smaller).
  • GPT OSS 120B, 120B, score 14.2% — on the efficiency frontier (best score at its size or smaller).
  • MiniMax M2.7, 229B, score 42.9% — on the efficiency frontier (best score at its size or smaller).
  • GLM 5, 754B, score 52.4% — on the efficiency frontier (best score at its size or smaller).

Terminal-Bench: frequently asked questions

What is the best open LLM on Terminal-Bench?
GLM 5 is the top open model on Terminal-Bench, scoring 52.4%. Among all models tested — including proprietary ones — it ranks #17.
What's the best Terminal-Bench model you can run on a 24 GB GPU?
GPT OSS 20B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 12 GB), scoring 3.1% on Terminal-Bench.
What's the best Terminal-Bench model you can run on a 12 GB GPU?
GPT OSS 20B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 12 GB), scoring 3.1% on Terminal-Bench.
Can open models match proprietary models on Terminal-Bench?
Not quite on Terminal-Bench: the strongest proprietary model (gpt-5.4-2026-03-05_unknown) scores 81.8%, ahead of the best open model (GLM 5) at 52.4% — but you can run the open one yourself.

Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.