About LLM Benchmarks
Benchmarks are standardized tests that measure what a language model can actually do — reason, write code, solve maths, follow instructions. This page explains the benchmarks llmrun aggregates, what each one measures, and how to read the scores so you can pick the best open model you can run on your hardware.
How llmrun aggregates scores
llmrun does not run any benchmark itself. We collect published results from public, permissively-licensed sources, attribute each score to its origin, and line them up next to the VRAM and GPU requirements for every open model. No proprietary composite, no re-scoring — just the field's own numbers, in one place, mapped to what you can run.
The categories
We group benchmarks by the skill they probe. Pick a leaderboard to see the ranked open models and which ones fit your hardware.
Reasoning
Multi-step logic and problem-solving — can the model think through a hard problem rather than recall an answer.
- GPQA Diamond — GPQA Diamond is a set of extremely hard, graduate-level science questions (physics, chemistry, biology) written by domain experts and filtered so that skilled non-experts with web access still fail. It measures genuine reasoning rather than memorization.
- ARC-AGI — ARC-AGI tests fluid, abstract reasoning on small visual grid puzzles where each task follows a novel rule the model must infer from a few examples. It deliberately resists memorization and is one of the most-watched measures of general reasoning progress.
- BIG-Bench Hard — BIG-Bench Hard (BBH) is the 23-task subset of BIG-Bench that earlier models found hardest — multi-step reasoning, logic, and word problems. It's a long-standing standard for tracking reasoning ability across model generations.
- LiveBench Reasoning — LiveBench Reasoning measures logical, multi-step reasoning using contamination-free questions that are refreshed regularly, so models cannot have trained on the test set.
- SimpleBench — SimpleBench is a set of everyday, common-sense and trick questions that humans answer easily but language models often get wrong. It probes basic reasoning and robustness rather than specialist knowledge.
Coding
Writing, editing, and fixing real code — measured by whether the resulting program actually runs and passes tests.
- SWE-bench Verified — SWE-bench Verified tests whether a model can resolve real GitHub issues from popular open-source Python projects. It is a human-validated subset focused on realistic software-engineering tasks.
- Terminal-Bench — Terminal-Bench measures whether a model can complete real, end-to-end tasks in a command-line environment — running commands, editing files, and chaining steps — making it an agentic test of practical software skill.
- Aider Polyglot — The Aider Polyglot benchmark measures real-world coding across several programming languages: the model edits code to solve Exercism exercises, and is scored on whether the final solution actually runs and passes the tests.
- LiveBench Coding — LiveBench Coding evaluates code generation and completion on fresh, contamination-free programming tasks that are updated regularly.
Math
Symbolic and quantitative problem-solving, from competition math to multi-step calculation.
- AIME 2024/2025 — AIME (American Invitational Mathematics Examination) is a prestigious high-school competition of hard, integer-answer problems. It's a widely-cited yardstick for multi-step mathematical reasoning, here on the 2024–2025 papers.
- MATH Level 5 — MATH Level 5 covers the hardest tier of competition-mathematics problems, testing multi-step symbolic and quantitative reasoning.
- GSM8K — GSM8K is a classic set of grade-school math word problems requiring several arithmetic reasoning steps. Long a default sanity check for reasoning, top models now score near the ceiling, so it best separates smaller models.
- FrontierMath — FrontierMath is a benchmark of exceptionally hard, original research-level mathematics problems created with professional mathematicians. Even the strongest models solve only a small fraction, making it a frontier measure of genuine mathematical ability.
- LiveBench Math — LiveBench Math measures mathematical problem-solving on contamination-free, regularly-refreshed questions, including competition-style problems.
Knowledge
Breadth and depth of factual knowledge across many subjects, paired with the reasoning to apply it.
- MMLU — MMLU (Massive Multitask Language Understanding) spans 57 subjects from history to law to medicine as multiple-choice questions. It is the long-standing default for broad knowledge and remains the most widely-reported general benchmark.
- Humanity's Last Exam — Humanity's Last Exam (HLE) is a set of extremely difficult, expert-written questions across many fields, designed so that even frontier models score low. It is built to stay hard as models improve, measuring the true knowledge frontier.
- SimpleQA — SimpleQA measures factual accuracy on short, fact-seeking questions with a single correct answer — directly probing how often a model is right versus confidently wrong (hallucination) on simple facts.
- HellaSwag — HellaSwag is a commonsense benchmark that asks a model to pick the most plausible continuation of an everyday situation. The wrong options are adversarially chosen to fool models, testing grounded commonsense understanding.
Reading the scores
Most scores are an accuracy or pass-rate (a percentage of items answered correctly); suites like LiveBench use a normalized 0–100 score. Higher is always better here. Each benchmark has its own scale and difficulty, so compare models within a single leaderboard rather than across different ones. On every leaderboard, the score-vs-size chart highlights models that punch above their weight — high quality at a small size is what makes a model worth running locally.
Open vs proprietary models
Open models have downloadable weights — those are llmrun's focus, and each links to its VRAM and GPU requirements. Proprietary models (GPT, Gemini, Claude and friends) are shown dimmed behind the All models toggle on each leaderboard, so you can see where the open field stands against the closed frontier without pretending you can run them at home.
FAQ
- Does llmrun run these benchmarks itself?
- No. llmrun is an aggregator — it collects published scores from public sources (Epoch AI, LiveBench, Aider and others), attributes each to its origin, and presents them in one place alongside the hardware you'd need to run each open model. We never re-run a benchmark or invent a composite score.
- What does a benchmark score actually mean?
- Most scores are an accuracy or pass-rate: the fraction of test items the model got right (shown as a percentage), or a normalized 0–100 score for suites like LiveBench. Higher is better. Because each benchmark uses its own scale and methodology, compare models within a single benchmark rather than across different ones.
- Why do some leaderboards show fewer models than others?
- We only list a model on a benchmark when a trustworthy public score exists for it. Newer or niche open models may not have been evaluated on every benchmark yet, so coverage varies by leaderboard.
- What's the difference between open and proprietary models here?
- Open models have downloadable weights you can run on your own hardware — those are llmrun's focus, and each links to a page with VRAM and GPU requirements. Proprietary (closed) models are shown dimmed behind the 'All models' toggle for context: you can see where the open field stands relative to the frontier, even though you can't run them locally.
- How fresh are the scores?
- Each leaderboard shows when its data was last refreshed. Several sources (LiveBench, Aider) update on a rolling basis with contamination-free questions, and we re-import periodically.