LLM Benchmarks
Public benchmark scores for open models you can actually run — every leaderboard mapped to the hardware it takes. Find the strongest model that fits your GPU.
New here? Read about benchmarks to learn what each one measures.
- 18
- Leaderboards
- 4
- Skill areas
- 258
- Ranked model scores
Reasoning
Multi-step logic and problem-solving — can the model think through a hard problem rather than recall an answer.
GPQA Diamond
Industry standardGPQA Diamond is a set of extremely hard, graduate-level science questions (physics, chemistry, biology) written by domain experts and filtered so that skilled non-experts with web access still fail. It measures genuine reasoning rather than memorization.
ARC-AGI
Industry standardARC-AGI tests fluid, abstract reasoning on small visual grid puzzles where each task follows a novel rule the model must infer from a few examples. It deliberately resists memorization and is one of the most-watched measures of general reasoning progress.
BIG-Bench Hard
BIG-Bench Hard (BBH) is the 23-task subset of BIG-Bench that earlier models found hardest — multi-step reasoning, logic, and word problems. It's a long-standing standard for tracking reasoning ability across model generations.
LiveBench Reasoning
LiveBench Reasoning measures logical, multi-step reasoning using contamination-free questions that are refreshed regularly, so models cannot have trained on the test set.
SimpleBench
SimpleBench is a set of everyday, common-sense and trick questions that humans answer easily but language models often get wrong. It probes basic reasoning and robustness rather than specialist knowledge.
Coding
Writing, editing, and fixing real code — measured by whether the resulting program actually runs and passes tests.
SWE-bench Verified
Industry standardSWE-bench Verified tests whether a model can resolve real GitHub issues from popular open-source Python projects. It is a human-validated subset focused on realistic software-engineering tasks.
Terminal-Bench
Terminal-Bench measures whether a model can complete real, end-to-end tasks in a command-line environment — running commands, editing files, and chaining steps — making it an agentic test of practical software skill.
Aider Polyglot
The Aider Polyglot benchmark measures real-world coding across several programming languages: the model edits code to solve Exercism exercises, and is scored on whether the final solution actually runs and passes the tests.
LiveBench Coding
LiveBench Coding evaluates code generation and completion on fresh, contamination-free programming tasks that are updated regularly.
Math
Symbolic and quantitative problem-solving, from competition math to multi-step calculation.
AIME 2024/2025
Industry standardAIME (American Invitational Mathematics Examination) is a prestigious high-school competition of hard, integer-answer problems. It's a widely-cited yardstick for multi-step mathematical reasoning, here on the 2024–2025 papers.
MATH Level 5
Industry standardMATH Level 5 covers the hardest tier of competition-mathematics problems, testing multi-step symbolic and quantitative reasoning.
GSM8K
Industry standardGSM8K is a classic set of grade-school math word problems requiring several arithmetic reasoning steps. Long a default sanity check for reasoning, top models now score near the ceiling, so it best separates smaller models.
FrontierMath
Industry standardFrontierMath is a benchmark of exceptionally hard, original research-level mathematics problems created with professional mathematicians. Even the strongest models solve only a small fraction, making it a frontier measure of genuine mathematical ability.
LiveBench Math
LiveBench Math measures mathematical problem-solving on contamination-free, regularly-refreshed questions, including competition-style problems.
Knowledge
Breadth and depth of factual knowledge across many subjects, paired with the reasoning to apply it.
MMLU
Industry standardMMLU (Massive Multitask Language Understanding) spans 57 subjects from history to law to medicine as multiple-choice questions. It is the long-standing default for broad knowledge and remains the most widely-reported general benchmark.
Humanity's Last Exam
Industry standardHumanity's Last Exam (HLE) is a set of extremely difficult, expert-written questions across many fields, designed so that even frontier models score low. It is built to stay hard as models improve, measuring the true knowledge frontier.
SimpleQA
SimpleQA measures factual accuracy on short, fact-seeking questions with a single correct answer — directly probing how often a model is right versus confidently wrong (hallucination) on simple facts.
HellaSwag
HellaSwag is a commonsense benchmark that asks a model to pick the most plausible continuation of an everyday situation. The wrong options are adversarially chosen to fool models, testing grounded commonsense understanding.
Scores are aggregated from public benchmark sources and attributed on each leaderboard. llmrun does not run these benchmarks.