Coding
Terminal-Bench Leaderboard
Terminal-Bench measures whether a model can complete real, end-to-end tasks in a command-line environment — running commands, editing files, and chaining steps — making it an agentic test of practical software skill.
Source: epoch11 open models ranked+46 proprietaryData through Apr 2026
All models ranked on Terminal-Bench
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.4-2026-03-05_unknown · proprietary | 81.8% |
| 2 | claude-opus-4-7_unknown · proprietary | 80.2% |
| 3 | claude-opus-4-6 · proprietary | 69.9% |
| 4 | gpt-5.2-codex · proprietary | 66.5% |
| 5 | gpt-5.5_unknown · proprietary | 66.1% |
| 6 | gpt-5.3-codex · proprietary | 64.7% |
| 7 | claude-opus-4-5-20251101 · proprietary | 63.1% |
| 8 | gpt-5.2-2025-12-11_medium · proprietary | 62.9% |
| 9 | gemini-3.1-pro-preview · proprietary | 61.4% |
| 10 | gpt-5.1-codex-max · proprietary | 60.4% |
| 11 | claude-opus-4-5-20251101_128K · proprietary | 59.1% |
| 12 | claude-opus-4-6_unknown · proprietary | 58.0% |
| 13 | grok-4-20 · proprietary | 57.3% |
| 14 | gemini-3-pro-preview · proprietary | 54.2% |
| 15 | gpt-5.2-2025-12-11_unknown · proprietary | 54.0% |
| 16 | claude-sonnet-4-6_unknown · proprietary | 53.4% |
| 17 | GLM 5 · 753.9B | 52.4% |
| 18 | claude-opus-4-5-20251101_unknown · proprietary | 51.7% |
| 19 | gemini-3-flash-preview · proprietary | 51.0% |
| 20 | gpt-5.1-2025-11-13_medium · proprietary | 47.6% |
| 21 | gpt-5.1-2025-11-13_unknown · proprietary | 47.6% |
| 22 | kimi-k2.5 · proprietary | 43.2% |
| 23 | gpt-5.1-codex-mini · proprietary | 43.1% |
| 24 | MiniMax M2.7 · 228.7B | 42.9% |
| 25 | MiniMax M2.5 · 228.7B | 42.2% |
| 26 | gpt-5-codex · proprietary | 41.3% |
| 27 | claude-sonnet-4-5-20250929 · proprietary | 40.1% |
| 28 | claude-sonnet-4-5-20250929_unknown · proprietary | 40.1% |
| 29 | deepseek/deepseek-v3.2 · proprietary | 39.6% |
| 30 | gpt-5.1-codex · proprietary | 36.9% |
| 31 | Kimi K2 Thinking · 1058.1B | 35.7% |
| 32 | claude-opus-4-1-20250805 · proprietary | 34.8% |
| 33 | claude-opus-4-1-20250805_unknown · proprietary | 34.8% |
| 34 | gpt-5-2025-08-07_medium · proprietary | 33.9% |
| 35 | gpt-5-2025-08-07_unknown · proprietary | 33.9% |
| 36 | GLM 4.7 · 358.3B | 33.3% |
| 37 | MiniMax-M2 · proprietary | 30.0% |
| 38 | MiniMax M2.1 · 228.7B | 29.2% |
| 39 | Kimi K2 Instruct · 1026.5B | 26.7% |
| 40 | qwen3.6-35b-a3b · proprietary | 24.6% |
| 41 | GLM 4.6 · 356.8B | 24.5% |
| 42 | Qwen3 Coder 480B A35B Instruct · 480.2B | 23.9% |
| 43 | grok-4-0709 · proprietary | 23.1% |
| 44 | gpt-5-mini-2025-08-07_medium · proprietary | 22.2% |
| 45 | gpt-5-mini-2025-08-07_unknown · proprietary | 22.2% |
| 46 | gemini-2.5-pro · proprietary | 16.4% |
| 47 | gemini-2.5-flash · proprietary | 15.4% |
| 48 | gemini-2.5-flash-preview-09-2025 · proprietary | 15.4% |
| 49 | GPT OSS 120B · 120.4B | 14.2% |
| 50 | gpt-oss-120b_unknown · proprietary | 14.2% |
| 51 | grok-code-fast-1 · proprietary | 14.2% |
| 52 | claude-haiku-4-5-20251001 · proprietary | 13.9% |
| 53 | claude-haiku-4-5-20251001_unknown · proprietary | 13.9% |
| 54 | gpt-5-nano-2025-08-07_medium · proprietary | 7.0% |
| 55 | gpt-5-nano-2025-08-07_unknown · proprietary | 7.0% |
| 56 | GPT OSS 20B · 21.5B | 3.1% |
| 57 | gpt-oss-20b_unknown · proprietary | 3.1% |
Score vs model size
Which models give the most quality for their size — the ones worth running locally.
- GPT OSS 20B, 22B, score 3.1% — on the efficiency frontier (best score at its size or smaller).
- GPT OSS 120B, 120B, score 14.2% — on the efficiency frontier (best score at its size or smaller).
- MiniMax M2.7, 229B, score 42.9% — on the efficiency frontier (best score at its size or smaller).
- GLM 5, 754B, score 52.4% — on the efficiency frontier (best score at its size or smaller).
Terminal-Bench: frequently asked questions
- What is the best open LLM on Terminal-Bench?
- GLM 5 is the top open model on Terminal-Bench, scoring 52.4%. Among all models tested — including proprietary ones — it ranks #17.
- What's the best Terminal-Bench model you can run on a 24 GB GPU?
- GPT OSS 20B is the highest-scoring open model that fits in 24 GB at 4-bit quantization (about 12 GB), scoring 3.1% on Terminal-Bench.
- What's the best Terminal-Bench model you can run on a 12 GB GPU?
- GPT OSS 20B is the highest-scoring open model that fits in 12 GB at 4-bit quantization (about 12 GB), scoring 3.1% on Terminal-Bench.
- Can open models match proprietary models on Terminal-Bench?
- Not quite on Terminal-Bench: the strongest proprietary model (gpt-5.4-2026-03-05_unknown) scores 81.8%, ahead of the best open model (GLM 5) at 52.4% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.