AIME 2025: Problems from the American Invitational Mathematics Examination
A benchmark for evaluating AI’s ability to solve challenging mathematics problems from the 2025 AIME - a prestigious high school mathematics competition.
Overview
The AIME 2025 benchmark is based on problems from the American Invitational Mathematics Examination, a prestigious high school mathematics competition.
This benchmark tests a model’s ability to solve challenging mathematics problems by generating step-by-step solutions and providing the correct final answer.
The benchmark uses the math-ai/AIME_2025 dataset from Hugging Face.
Dataset
There are 30 samples in the dataset. Each sample includes:
- Unique ID of a mathematics problem from the AIME 2025
- Problem description
- The correct answer
Problems are structured:
{
"id": "0",
"problem": "Problem statement...",
"answer": "Numerical answer"
}
The model is instructed to solve the problem step by step and provide the final answer in the format “ANSWER: $ANSWER”.
Scoring
The benchmark uses exact match scoring to determine if the model’s answer matches the expected answer.
Results
Model | Accuracy | Std Error |
---|---|---|
openai/gpt-4.1-nano | 0.2667 | 0.0821 |
openai/gpt-4o-mini | 0.1000 | 0.0557 |