LLM Benchmarks

Public benchmark scores for open models you can actually run — every leaderboard mapped to the hardware it takes. Find the strongest model that fits your GPU.

New here? Read about benchmarks to learn what each one measures.

18
Leaderboards
4
Skill areas
258
Ranked model scores

Reasoning

Multi-step logic and problem-solving — can the model think through a hard problem rather than recall an answer.

GPQA Diamond

Industry standard

GPQA Diamond is a set of extremely hard, graduate-level science questions (physics, chemistry, biology) written by domain experts and filtered so that skilled non-experts with web access still fail. It measures genuine reasoning rather than memorization.

#1
gpt-5.4-pro-2026-03-05_xhighOpenAI94.6%
#18
GLM 5open87.8%
169 models · 28 openvia epoch

ARC-AGI

Industry standard

ARC-AGI tests fluid, abstract reasoning on small visual grid puzzles where each task follows a novel rule the model must infer from a few examples. It deliberately resists memorization and is one of the most-watched measures of general reasoning progress.

#1
gemini-3.1-pro-previewGoogle DeepMind98.0%
#38
MiniMax M2.5open63.7%
137 models · 7 openvia epoch

BIG-Bench Hard

BIG-Bench Hard (BBH) is the 23-task subset of BIG-Bench that earlier models found hardest — multi-step reasoning, logic, and word problems. It's a long-standing standard for tracking reasoning ability across model generations.

#1
gemini-1.5-pro-001Google DeepMind89.2%
#4
Llama 3.1 405Bopen82.9%
50 models · 11 openvia epoch

LiveBench Reasoning

LiveBench Reasoning measures logical, multi-step reasoning using contamination-free questions that are refreshed regularly, so models cannot have trained on the test set.

#1
gpt-5.1-2025-11-13_highOpenAI95.8
#6
QwQ 32Bopen83.5
52 models · 13 openvia livebench

SimpleBench

SimpleBench is a set of everyday, common-sense and trick questions that humans answer easily but language models often get wrong. It probes basic reasoning and robustness rather than specialist knowledge.

#1
gemini-3.1-pro-previewGoogle DeepMind79.6%
#20
GLM 5.1open58.7%
76 models · 10 openvia epoch

Coding

Writing, editing, and fixing real code — measured by whether the resulting program actually runs and passes tests.

Math

Symbolic and quantitative problem-solving, from competition math to multi-step calculation.

AIME 2024/2025

Industry standard

AIME (American Invitational Mathematics Examination) is a prestigious high-school competition of hard, integer-answer problems. It's a widely-cited yardstick for multi-step mathematical reasoning, here on the 2024–2025 papers.

#1
gpt-5.5-pro-pre-release_xhighOpenAI100.0%
#14
GLM 5.1open92.2%
141 models · 22 openvia epoch

MATH Level 5

Industry standard

MATH Level 5 covers the hardest tier of competition-mathematics problems, testing multi-step symbolic and quantitative reasoning.

#1
gpt-5-2025-08-07_highOpenAI98.1%
#9
DeepSeek R1 0528open96.6%
108 models · 23 openvia epoch

GSM8K

Industry standard

GSM8K is a classic set of grade-school math word problems requiring several arithmetic reasoning steps. Long a default sanity check for reasoning, top models now score near the ceiling, so it best separates smaller models.

#1
DeepSeek-Coder-V2-InstructDeepSeek94.5%
#2
Qwen2.5 Coder 14B Instructopen94.2%
93 models · 27 openvia epoch

FrontierMath

Industry standard

FrontierMath is a benchmark of exceptionally hard, original research-level mathematics problems created with professional mathematicians. Even the strongest models solve only a small fraction, making it a frontier measure of genuine mathematical ability.

#1
gpt-5.5-pro-pre-release_highOpenAI52.4%
#20
GLM 5.1open33.5%
100 models · 3 openvia epoch

LiveBench Math

LiveBench Math measures mathematical problem-solving on contamination-free, regularly-refreshed questions, including competition-style problems.

#1
gpt-5.1-2025-11-13_highOpenAI94.5
#3
DeepSeek R1open80.7
52 models · 13 openvia livebench

Knowledge

Breadth and depth of factual knowledge across many subjects, paired with the reasoning to apply it.

Scores are aggregated from public benchmark sources and attributed on each leaderboard. llmrun does not run these benchmarks.