Coding
Terminal-Bench Leaderboard
Terminal-Bench measures whether a model can complete real, end-to-end tasks in a command-line environment — running commands, editing files, and chaining steps — making it an agentic test of practical software skill.
Source: epoch11 open models ranked+46 proprietaryData through Apr 2026
Open models ranked on Terminal-Bench
# shows rank among open models / rank overall (including proprietary).
| # | Model | Score |
|---|---|---|
| 1 / 17 | GLM 5 · 753.9B | 52.4% |
| 2 / 24 | MiniMax M2.7 · 228.7B | 42.9% |
| 3 / 25 | MiniMax M2.5 · 228.7B | 42.2% |
| 4 / 31 | Kimi K2 Thinking · 1058.1B | 35.7% |
| 5 / 36 | GLM 4.7 · 358.3B | 33.3% |
| 6 / 38 | MiniMax M2.1 · 228.7B | 29.2% |
| 7 / 39 | Kimi K2 Instruct · 1026.5B | 26.7% |
| 8 / 41 | GLM 4.6 · 356.8B | 24.5% |
| 9 / 42 | Qwen3 Coder 480B A35B Instruct · 480.2B | 23.9% |
| 10 / 49 | GPT OSS 120B · 120.4B | 14.2% |
| 11 / 56 | GPT OSS 20B · 21.5B | 3.1% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- GPT OSS 20B, 22B, score 3.1% — on the efficiency frontier (best score at its size or smaller).
- GPT OSS 120B, 120B, score 14.2% — on the efficiency frontier (best score at its size or smaller).
- MiniMax M2.7, 229B, score 42.9% — on the efficiency frontier (best score at its size or smaller).
- GLM 5, 754B, score 52.4% — on the efficiency frontier (best score at its size or smaller).
Terminal-Bench: frequently asked questions
- What is the best open LLM on Terminal-Bench?
- GLM 5 is the top open model on Terminal-Bench, scoring 52.4%. Among all models tested — including proprietary ones — it ranks #17.
- What's the best Terminal-Bench model you can run on a 24 GB GPU?
- GPT OSS 20B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 12 GB), scoring 3.1% on Terminal-Bench.
- What's the best Terminal-Bench model you can run on a 12 GB GPU?
- GPT OSS 20B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 12 GB), scoring 3.1% on Terminal-Bench.
- Can open models match proprietary models on Terminal-Bench?
- Not quite on Terminal-Bench: the strongest proprietary model (gpt-5.4-2026-03-05_unknown) scores 81.8%, ahead of the best open model (GLM 5) at 52.4% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.