AIME 2025
American Invitational Mathematics Examination 2025. Integer-answer math problems from a high school olympiad.
Category: math ·
Metric: accuracy ·
Source: artofproblemsolving.com ↗
Leaderboard
| Rank | Model | Provider | Score | Measured | Source |
|---|---|---|---|---|---|
| 1 | GPT-5 | OpenAI | 94.6 | 2025-08-07 | ↗ |
| 2 | Grok 3 | xAI | 93.3 | 2025-02-17 | ↗ |
| 3 | o3 | OpenAI | 88.9 | 2025-04-16 | ↗ |
| 4 | Gemini 2.5 Pro | 86.7 | 2025-03-25 | ↗ | |
| 5 | DeepSeek-R1 | DeepSeek | 79.8 | 2025-01-20 | ↗ |
What this benchmark measures, in detail
American Invitational Mathematics Examination 2025. Integer-answer math problems from a high school olympiad.
Different benchmarks measure different things. A model that excels on AIME 2025 may underperform on real-world workloads if the benchmark's distribution doesn't match your data. Use benchmark scores as a triage signal — narrow to a shortlist — then evaluate on your actual workload before committing.
Methodology notes
Scores in the leaderboard are taken from the model's release announcement or model card, cited via the "Source" link. Where two sources disagree (which happens often for SWE-bench and IFEval), the linked primary source wins. Reproducibility for some benchmarks (notably anything graded by an LLM) varies by run — treat the score as ±2-3 points unless the source is a peer-reviewed result.