Does llmrun run these benchmarks itself?

No. llmrun is an aggregator — it collects published scores from public sources (Epoch AI, LiveBench, Aider and others), attributes each to its origin, and presents them in one place alongside the hardware you'd need to run each open model. We never re-run a benchmark or invent a composite score.

What does a benchmark score actually mean?

Most scores are an accuracy or pass-rate: the fraction of test items the model got right (shown as a percentage), or a normalized 0–100 score for suites like LiveBench. Higher is better. Because each benchmark uses its own scale and methodology, compare models within a single benchmark rather than across different ones.

Why do some leaderboards show fewer models than others?

We only list a model on a benchmark when a trustworthy public score exists for it. Newer or niche open models may not have been evaluated on every benchmark yet, so coverage varies by leaderboard.

What's the difference between open and proprietary models here?

Open models have downloadable weights you can run on your own hardware — those are llmrun's focus, and each links to a page with VRAM and GPU requirements. Proprietary (closed) models are shown dimmed behind the 'All models' toggle for context: you can see where the open field stands relative to the frontier, even though you can't run them locally.

How fresh are the scores?

Each leaderboard shows when its data was last refreshed. Several sources (LiveBench, Aider) update on a rolling basis with contamination-free questions, and we re-import periodically.

About LLM Benchmarks

Benchmarks are standardized tests that measure what a language model can actually do — reason, write code, solve maths, follow instructions. This page explains the benchmarks llmrun aggregates, what each one measures, and how to read the scores so you can pick the best open model you can run on your hardware.

How llmrun aggregates scores

llmrun does not run any benchmark itself. We collect published results from public, permissively-licensed sources, attribute each score to its origin, and line them up next to the VRAM and GPU requirements for every open model. No proprietary composite, no re-scoring — just the field's own numbers, in one place, mapped to what you can run.

The categories

We group benchmarks by the skill they probe. Pick a leaderboard to see the ranked open models and which ones fit your hardware.

Reasoning

Multi-step logic and problem-solving — can the model think through a hard problem rather than recall an answer.

GPQA Diamond — GPQA Diamond is a set of extremely hard, graduate-level science questions (physics, chemistry, biology) written by domain experts and filtered so that skilled non-experts with web access still fail. It measures genuine reasoning rather than memorization.
ARC-AGI — ARC-AGI tests fluid, abstract reasoning on small visual grid puzzles where each task follows a novel rule the model must infer from a few examples. It deliberately resists memorization and is one of the most-watched measures of general reasoning progress.
LiveBench Reasoning — LiveBench Reasoning measures logical, multi-step reasoning using contamination-free questions that are refreshed regularly, so models cannot have trained on the test set.
BIG-Bench Hard — BIG-Bench Hard (BBH) is the 23-task subset of BIG-Bench that earlier models found hardest — multi-step reasoning, logic, and word problems. It's a long-standing standard for tracking reasoning ability across model generations.
SimpleBench — SimpleBench is a set of everyday, common-sense and trick questions that humans answer easily but language models often get wrong. It probes basic reasoning and robustness rather than specialist knowledge.

Coding

Writing, editing, and fixing real code — measured by whether the resulting program actually runs and passes tests.

LiveBench Coding — LiveBench Coding evaluates code generation and completion on fresh, contamination-free programming tasks that are updated regularly.
SWE-bench Verified — SWE-bench Verified tests whether a model can resolve real GitHub issues from popular open-source Python projects, scored on the official swebench.com leaderboard as the percentage of human-validated issues actually fixed. It is the headline measure of practical, agentic software-engineering ability — where open-weight models like Qwen3-Coder, GLM-4.6, Kimi K2 and DeepSWE are now competitive with the frontier.
SWE-bench Bash Only — SWE-bench Bash Only runs the SWE-bench Verified issues through a minimal, single-tool bash agent — no specialised scaffolding — so the score reflects the model's own agentic coding ability rather than the harness around it. A cleaner, harder read on raw software-engineering skill.
Terminal-Bench — Terminal-Bench measures whether a model can complete real, end-to-end tasks in a command-line environment — running commands, editing files, and chaining steps — making it an agentic test of practical software skill.
Aider Polyglot — The Aider Polyglot benchmark measures real-world coding across several programming languages: the model edits code to solve Exercism exercises, and is scored on whether the final solution actually runs and passes the tests.
SWE-bench Multilingual — SWE-bench Multilingual extends SWE-bench beyond Python to real GitHub issues across many programming languages, measuring whether a model can fix bugs in codebases written in Java, Go, Rust, TypeScript and more.
SWE-bench Lite — SWE-bench Lite is a smaller, lower-cost subset of SWE-bench focused on self-contained bug fixes. It is the quickest of the SWE-bench boards to run and a common entry point for comparing coding agents.

Math

Symbolic and quantitative problem-solving, from competition math to multi-step calculation.

AIME 2024/2025 — AIME (American Invitational Mathematics Examination) is a prestigious high-school competition of hard, integer-answer problems. It's a widely-cited yardstick for multi-step mathematical reasoning, here on the 2024–2025 papers.
MATH Level 5 — MATH Level 5 covers the hardest tier of competition-mathematics problems, testing multi-step symbolic and quantitative reasoning.
LiveBench Math — LiveBench Math measures mathematical problem-solving on contamination-free, regularly-refreshed questions, including competition-style problems.
FrontierMath — FrontierMath is a benchmark of exceptionally hard, original research-level mathematics problems created with professional mathematicians. Even the strongest models solve only a small fraction, making it a frontier measure of genuine mathematical ability.
GSM8K — GSM8K is a classic set of grade-school math word problems requiring several arithmetic reasoning steps. Long a default sanity check for reasoning, top models now score near the ceiling, so it best separates smaller models.

Knowledge

Breadth and depth of factual knowledge across many subjects, paired with the reasoning to apply it.

MMLU — MMLU (Massive Multitask Language Understanding) spans 57 subjects from history to law to medicine as multiple-choice questions. It is the long-standing default for broad knowledge and remains the most widely-reported general benchmark.
Humanity's Last Exam — Humanity's Last Exam (HLE) is a set of extremely difficult, expert-written questions across many fields, designed so that even frontier models score low. It is built to stay hard as models improve, measuring the true knowledge frontier.
SimpleQA — SimpleQA measures factual accuracy on short, fact-seeking questions with a single correct answer — directly probing how often a model is right versus confidently wrong (hallucination) on simple facts.
HellaSwag — HellaSwag is a commonsense benchmark that asks a model to pick the most plausible continuation of an everyday situation. The wrong options are adversarially chosen to fool models, testing grounded commonsense understanding.
MMLU-Pro — MMLU-Pro is a harder, cleaned-up successor to MMLU with ten answer choices and more reasoning-heavy questions across 14 subjects, measuring broad knowledge and reasoning together.

Reading the scores

Most scores are an accuracy or pass-rate (a percentage of items answered correctly); suites like LiveBench use a normalized 0–100 score. Higher is always better here. Each benchmark has its own scale and difficulty, so compare models within a single leaderboard rather than across different ones. On every leaderboard, the score-vs-size chart highlights models that punch above their weight — high quality at a small size is what makes a model worth running locally.

Open vs proprietary models

Open models have downloadable weights — those are llmrun's focus, and each links to its VRAM and GPU requirements. Proprietary models (GPT, Gemini, Claude and friends) are shown dimmed behind the All models toggle on each leaderboard, so you can see where the open field stands against the closed frontier without pretending you can run them at home.

FAQ

Does llmrun run these benchmarks itself?: No. llmrun is an aggregator — it collects published scores from public sources (Epoch AI, LiveBench, Aider and others), attributes each to its origin, and presents them in one place alongside the hardware you'd need to run each open model. We never re-run a benchmark or invent a composite score.
What does a benchmark score actually mean?: Most scores are an accuracy or pass-rate: the fraction of test items the model got right (shown as a percentage), or a normalized 0–100 score for suites like LiveBench. Higher is better. Because each benchmark uses its own scale and methodology, compare models within a single benchmark rather than across different ones.
Why do some leaderboards show fewer models than others?: We only list a model on a benchmark when a trustworthy public score exists for it. Newer or niche open models may not have been evaluated on every benchmark yet, so coverage varies by leaderboard.
What's the difference between open and proprietary models here?: Open models have downloadable weights you can run on your own hardware — those are llmrun's focus, and each links to a page with VRAM and GPU requirements. Proprietary (closed) models are shown dimmed behind the 'All models' toggle for context: you can see where the open field stands relative to the frontier, even though you can't run them locally.
How fresh are the scores?: Each leaderboard shows when its data was last refreshed. Several sources (LiveBench, Aider) update on a rolling basis with contamination-free questions, and we re-import periodically.

← Browse all benchmark leaderboards