Math
FrontierMath Leaderboard
FrontierMath is a benchmark of exceptionally hard, original research-level mathematics problems created with professional mathematicians. Even the strongest models solve only a small fraction, making it a frontier measure of genuine mathematical ability.
Source: epoch3 open models ranked+97 proprietaryData through May 2026
All models ranked on FrontierMath
Proprietary / closed models are shown dimmed — you can't run them locally, but they show where the open field stands.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-pro-pre-release_high · proprietary | 52.4% |
| 2 | gpt-5.5-pre-release_xhigh · proprietary | 51.7% |
| 3 | gpt-5.5-pro-pre-release_xhigh · proprietary | 51.0% |
| 4 | gpt-5.4-pro-2026-03-05_xhigh · proprietary | 50.0% |
| 5 | gpt-5.4-2026-03-05_xhigh · proprietary | 47.6% |
| 6 | claude-opus-4-7_xhigh · proprietary | 43.8% |
| 7 | claude-opus-4-6_max · proprietary | 40.7% |
| 8 | gpt-5.2-2025-12-11_xhigh · proprietary | 40.7% |
| 9 | gpt-5.2-2025-12-11_high · proprietary | 40.3% |
| 10 | claude-opus-4-6_32K · proprietary | 40.0% |
| 11 | claude-opus-4-6_64K · proprietary | 39.7% |
| 12 | muse-spark · proprietary | 39.0% |
| 13 | gemini-3.5-flash_high · proprietary | 39.0% |
| 14 | kimi-k2.6 · proprietary | 39.0% |
| 15 | claude-opus-4-6 · proprietary | 38.3% |
| 16 | gemini-3-pro-preview · proprietary | 37.6% |
| 17 | gemini-3.1-pro-preview · proprietary | 36.9% |
| 18 | gpt-5.2-2025-12-11_medium · proprietary | 36.9% |
| 19 | gemini-3-flash-preview · proprietary | 35.6% |
| 20 | GLM 5.1 · 753.9B | 33.5% |
| 21 | gpt-5-2025-08-07_high · proprietary | 32.4% |
| 22 | claude-sonnet-4-6_16K · proprietary | 32.4% |
| 23 | gpt-5.1-2025-11-13_high · proprietary | 31.0% |
| 24 | gemini-2.5-deep-think-2025-08-01-webapp · proprietary | 29.0% |
| 25 | gpt-5.4-mini-2026-03-17_high · proprietary | 28.3% |
| 26 | fireworks/kimi-k2p5 · proprietary | 27.9% |
| 27 | gpt-5-2025-08-07_medium · proprietary | 27.2% |
| 28 | gpt-5-mini-2025-08-07_high · proprietary | 27.2% |
| 29 | gpt-5.1-2025-11-13_medium · proprietary | 26.9% |
| 30 | gpt-5.2-2025-12-11_low · proprietary | 26.6% |
| 31 | qwen3.6-plus · proprietary | 26.2% |
| 32 | gpt-5.4-nano-2026-03-17_high · proprietary | 25.9% |
| 33 | o4-mini-2025-04-16_high · proprietary | 24.8% |
| 34 | qwen3.6-max-preview · proprietary | 23.1% |
| 35 | fireworks/deepseek-v3p2 · proprietary | 22.1% |
| 36 | moonshotai/Kimi-K2-Thinking · proprietary | 21.4% |
| 37 | qwen3.5-plus · proprietary | 21.0% |
| 38 | claude-opus-4-5-20251101 · proprietary | 20.7% |
| 39 | claude-opus-4-5-20251101_32K · proprietary | 20.7% |
| 40 | claude-opus-4-5-20251101_16K · proprietary | 20.3% |
| 41 | gpt-5-mini-2025-08-07_medium · proprietary | 20.3% |
| 42 | grok-4-0709 · proprietary | 19.7% |
| 43 | o4-mini-2025-04-16_medium · proprietary | 19.0% |
| 44 | o3-2025-04-16_high · proprietary | 18.7% |
| 45 | gpt-5.1-2025-11-13_low · proprietary | 17.3% |
| 46 | o3-2025-04-16_medium · proprietary | 16.9% |
| 47 | GLM 5 · 753.9B | 16.4% |
| 48 | claude-sonnet-4-5-20250929_32K · proprietary | 15.2% |
| 49 | gemini-2.5-pro · proprietary | 14.1% |
| 50 | claude-sonnet-4-5-20250929_59K · proprietary | 13.5% |
| 51 | o3-mini-2025-01-31_high · proprietary | 12.4% |
| 52 | o4-mini-2025-04-16_low · proprietary | 10.7% |
| 53 | gemini-2.5-pro-preview-06-05 · proprietary | 10.3% |
| 54 | qwen3.6-flash · proprietary | 10.3% |
| 55 | o3-2025-04-16_low · proprietary | 9.7% |
| 56 | claude-sonnet-4-5-20250929 · proprietary | 9.3% |
| 57 | o1-2024-12-17_high · proprietary | 9.3% |
| 58 | Qwen/Qwen3-235B-A22B-Thinking-2507 · proprietary | 8.5% |
| 59 | gpt-5-nano-2025-08-07_high · proprietary | 8.3% |
| 60 | o3-mini-2025-01-31_medium · proprietary | 8.1% |
| 61 | claude-opus-4-1-20250805_27K · proprietary | 7.2% |
| 62 | gpt-5-nano-2025-08-07_medium · proprietary | 7.2% |
| 63 | qwen3.5-flash · proprietary | 6.2% |
| 64 | claude-haiku-4-5-20251001_32K · proprietary | 5.9% |
| 65 | claude-opus-4-1-20250805 · proprietary | 5.9% |
| 66 | grok-3-mini-beta_high · proprietary | 5.9% |
| 67 | gpt-4.1-2025-04-14 · proprietary | 5.5% |
| 68 | gemini-2.5-flash · proprietary | 4.8% |
| 69 | claude-opus-4-20250514 · proprietary | 4.5% |
| 70 | gpt-4.1-mini-2025-04-14 · proprietary | 4.5% |
| 71 | claude-3-7-sonnet-20250219_16K · proprietary | 4.1% |
| 72 | claude-haiku-4-5-20251001 · proprietary | 4.1% |
| 73 | claude-opus-4-20250514_27K · proprietary | 4.1% |
| 74 | claude-sonnet-4-20250514 · proprietary | 4.1% |
| 75 | zai-org/GLM-4.6 · proprietary | 3.8% |
| 76 | grok-3-beta · proprietary | 3.8% |
| 77 | claude-3-7-sonnet-20250219_32K · proprietary | 3.5% |
| 78 | claude-3-7-sonnet-20250219 · proprietary | 3.1% |
| 79 | claude-3-7-sonnet-20250219_64K · proprietary | 3.1% |
| 80 | grok-3-mini-beta_low · proprietary | 2.8% |
| 81 | zai-org/GLM-4.7 · proprietary | 2.4% |
| 82 | gpt-5.1-2025-11-13_none · proprietary | 2.1% |
| 83 | claude-3-5-sonnet-20241022 · proprietary | 2.1% |
| 84 | DeepSeek-V3 · proprietary | 1.7% |
| 85 | gemini-2.0-flash-001 · proprietary | 1.7% |
| 86 | o1-mini-2024-09-12_medium · proprietary | 1.7% |
| 87 | qwen-plus-2025-04-28 · proprietary | 1.7% |
| 88 | o1-mini-2024-09-12_high · proprietary | 1.4% |
| 89 | claude-3-5-sonnet-20240620 · proprietary | 1.0% |
| 90 | gpt-4.1-nano-2025-04-14 · proprietary | 1.0% |
| 91 | qwen-max-2025-01-25 · proprietary | 1.0% |
| 92 | grok-2-1212 · proprietary | 0.7% |
| 93 | Llama-4-Maverick-17B-128E-Instruct-FP8 · proprietary | 0.7% |
| 94 | mistral-medium-2505 · proprietary | 0.4% |
| 95 | claude-3-5-haiku-20241022 · proprietary | 0.3% |
| 96 | gpt-4o-2024-08-06 · proprietary | 0.3% |
| 97 | gpt-4o-2024-11-20 · proprietary | 0.3% |
| 98 | mistral-large-2411 · proprietary | 0.3% |
| 99 | gemini-1.5-flash-002 · proprietary | 0.0% |
| 100 | Llama 4 Scout 17B 16E Instruct · 108.6B | 0.0% |
FrontierMath: frequently asked questions
- What is the best open LLM on FrontierMath?
- GLM 5.1 is the top open model on FrontierMath, scoring 33.5%. Among all models tested — including proprietary ones — it ranks #20.
- Can open models match proprietary models on FrontierMath?
- Not quite on FrontierMath: the strongest proprietary model (gpt-5.5-pro-pre-release_high) scores 52.4%, ahead of the best open model (GLM 5.1) at 33.5% — but you can run the open one yourself.
Scores aggregated from epoch. llmrun does not run this benchmark — see the source for methodology, or the about benchmarks for what it measures.