LLM Benchmarks

Public benchmark scores for open models you can actually run — every leaderboard mapped to the hardware it takes. Find the strongest model that fits your GPU.

New here? Read about benchmarks to learn what each one measures.

22
Leaderboards: 4
Skill areas: 588
Ranked model scores

Reasoning

Multi-step logic and problem-solving — can the model think through a hard problem rather than recall an answer.

GPQA Diamond

Industry standard

GPQA Diamond is a set of extremely hard, graduate-level science questions (physics, chemistry, biology) written by domain experts and filtered so that skilled non-experts with web access still fail. It measures genuine reasoning rather than memorization.

GPT 5.4 Pro (Mar 05, 2026, xhigh)OpenAI94.6%

#17

Kimi K2.6open90.8%

182 models · 46 openvia epoch

ARC-AGI

Industry standard

ARC-AGI tests fluid, abstract reasoning on small visual grid puzzles where each task follows a novel rule the model must infer from a few examples. It deliberately resists memorization and is one of the most-watched measures of general reasoning progress.

Gemini 3.1 Pro PreviewGoogle DeepMind98.0%

#55

Kimi K2.5open65.3%

158 models · 10 openvia epoch

LiveBench Reasoning

Industry standard

LiveBench Reasoning measures logical, multi-step reasoning using contamination-free questions that are refreshed regularly, so models cannot have trained on the test set.

GPT 5.6 Sol MaxOpenAI91.7

#22

Kimi K2.7 Codeopen82.8

37 models · 8 openvia livebench

BIG-Bench Hard

BIG-Bench Hard (BBH) is the 23-task subset of BIG-Bench that earlier models found hardest — multi-step reasoning, logic, and word problems. It's a long-standing standard for tracking reasoning ability across model generations.

Gemini 1.5 Pro 001Google DeepMind89.2%

DeepSeek v3open87.5%

50 models · 37 openvia epoch

SimpleBench

SimpleBench is a set of everyday, common-sense and trick questions that humans answer easily but language models often get wrong. It probes basic reasoning and robustness rather than specialist knowledge.

Claude Fable 5 MaxAnthropic81.9%

#22

DeepSeek V4 Proopen61.2%

90 models · 19 openvia epoch

Coding

Writing, editing, and fixing real code — measured by whether the resulting program actually runs and passes tests.

LiveBench Coding

Industry standard

LiveBench Coding evaluates code generation and completion on fresh, contamination-free programming tasks that are updated regularly.

Claude Fable 5 MaxAnthropic86.0

#11

GLM 5.2open79.7

37 models · 8 openvia livebench

SWE-bench Verified

Industry standard

SWE-bench Verified tests whether a model can resolve real GitHub issues from popular open-source Python projects, scored on the official swebench.com leaderboard as the percentage of human-validated issues actually fixed. It is the headline measure of practical, agentic software-engineering ability — where open-weight models like Qwen3-Coder, GLM-4.6, Kimi K2 and DeepSWE are now competitive with the frontier.

live-SWE-agent + Claude 4.5 Opus medium (20251101)79.2%

#32

Kimi K2 Instruct 0905open71.2%

163 models · 13 openvia swebench

SWE-bench Bash Only

SWE-bench Bash Only runs the SWE-bench Verified issues through a minimal, single-tool bash agent — no specialised scaffolding — so the score reflects the model's own agentic coding ability rather than the harness around it. A cleaner, harder read on raw software-engineering skill.

Claude 4.5 Opus (high reasoning)Anthropic76.8%

#24

Kimi K2 Thinkingopen63.4%

48 models · 9 openvia swebench

Terminal-Bench

Terminal-Bench measures whether a model can complete real, end-to-end tasks in a command-line environment — running commands, editing files, and chaining steps — making it an agentic test of practical software skill.

Claude Opus 4.7 (unspecified)Anthropic90.2%

#22

GLM 5open52.4%

57 models · 16 openvia epoch

Aider Polyglot

The Aider Polyglot benchmark measures real-world coding across several programming languages: the model edits code to solve Exercism exercises, and is scored on whether the final solution actually runs and passes the tests.

GPT 5 (Aug 07, 2025, high)OpenAI88.0%

#13

DeepSeek V3.2 Expopen74.2%

69 models · 18 openvia epoch

SWE-bench Multilingual

SWE-bench Multilingual extends SWE-bench beyond Python to real GitHub issues across many programming languages, measuring whether a model can fix bugs in codebases written in Java, Go, Rust, TypeScript and more.

Gemini 3 FlashGoogle72.7%

GLM 5open69.7%

14 models · 4 openvia swebench

SWE-bench Lite

SWE-bench Lite is a smaller, lower-cost subset of SWE-bench focused on self-contained bug fixes. It is the quickest of the SWE-bench boards to run and a common entry point for comparing coding agents.

ExpeRepair-v1.0 + Claude 4 Sonnet60.3%

Qwen3 Coder 30B A3B Instructopen49.7%

80 models · 3 openvia swebench

Math

Symbolic and quantitative problem-solving, from competition math to multi-step calculation.

AIME 2024/2025

Industry standard

AIME (American Invitational Mathematics Examination) is a prestigious high-school competition of hard, integer-answer problems. It's a widely-cited yardstick for multi-step mathematical reasoning, here on the 2024–2025 papers.

GPT 5.6 Sol MaxOpenAI100.0%

#11

DeepSeek V4 Proopen96.7%

155 models · 34 openvia epoch

MATH Level 5

Industry standard

MATH Level 5 covers the hardest tier of competition-mathematics problems, testing multi-step symbolic and quantitative reasoning.

GPT 5 (Aug 07, 2025, high)OpenAI98.1%

DeepSeek R1 0528open96.6%

108 models · 32 openvia epoch

LiveBench Math

Industry standard

LiveBench Math measures mathematical problem-solving on contamination-free, regularly-refreshed questions, including competition-style problems.

GPT 5.6 Sol MaxOpenAI96.2

#15

DeepSeek V4 Proopen90.7

37 models · 8 openvia livebench

FrontierMath

Industry standard

FrontierMath is a benchmark of exceptionally hard, original research-level mathematics problems created with professional mathematicians. Even the strongest models solve only a small fraction, making it a frontier measure of genuine mathematical ability.

GPT 5.5 Pro Pre Release (high)OpenAI52.4%

#14

Kimi K2.6open39.0%

101 models · 12 openvia epoch

GSM8K

GSM8K is a classic set of grade-school math word problems requiring several arithmetic reasoning steps. Long a default sanity check for reasoning, top models now score near the ceiling, so it best separates smaller models.

DeepSeek Coder v2 Instructopen94.5%

GPT 4 (Mar 14)OpenAI92.0%

93 models · 59 openvia epoch

Knowledge

Breadth and depth of factual knowledge across many subjects, paired with the reasoning to apply it.

MMLU

Industry standard

MMLU (Massive Multitask Language Understanding) spans 57 subjects from history to law to medicine as multiple-choice questions. It is the long-standing default for broad knowledge and remains the most widely-reported general benchmark.

GPT 4o (Nov 20, 2024)OpenAI88.1%

DeepSeek v3open87.2%

136 models · 76 openvia epoch

Humanity's Last Exam

Industry standard

Humanity's Last Exam (HLE) is a set of extremely difficult, expert-written questions across many fields, designed so that even frontier models score low. It is built to stay hard as models improve, measuring the true knowledge frontier.

Gemini 3.1 Pro PreviewGoogle DeepMind46.4%

#13

Kimi K2.5open24.4%

46 models · 4 openvia epoch

SimpleQA

SimpleQA measures factual accuracy on short, fact-seeking questions with a single correct answer — directly probing how often a model is right versus confidently wrong (hallucination) on simple facts.

Gemini 3.1 Pro PreviewGoogle DeepMind77.3%

#12

DeepSeek V4 Proopen57.0%

65 models · 11 openvia epoch

HellaSwag

HellaSwag is a commonsense benchmark that asks a model to pick the most plausible continuation of an everyday situation. The wrong options are adversarially chosen to fool models, testing grounded commonsense understanding.

GPT 4 32K (Mar 14)OpenAI95.3%

Llama 3.1 405Bopen89.2%

76 models · 42 openvia epoch

MMLU-Pro

MMLU-Pro is a harder, cleaned-up successor to MMLU with ten answer choices and more reasoning-heavy questions across 14 subjects, measuring broad knowledge and reasoning together.

Gemini-3.1-ProGoogle91.2%

MiniMax M2.1open88.0%

259 models · 119 openvia tigerlab

Scores are aggregated from public benchmark sources and attributed on each leaderboard. llmrun does not run these benchmarks.