AIME 2025: Problems from the American Invitational Mathematics Examination

Mathematics

A benchmark for evaluating AI’s ability to solve challenging mathematics problems from the 2025 AIME - a prestigious high school mathematics competition.

Overview

The AIME 2025 benchmark is based on problems from the American Invitational Mathematics Examination, a prestigious high school mathematics competition.

This benchmark tests a model’s ability to solve challenging mathematics problems by generating step-by-step solutions and providing the correct final answer.

The benchmark uses the math-ai/AIME_2025 dataset from Hugging Face.

Dataset

There are 30 samples in the dataset. Each sample includes:

  • Unique ID of a mathematics problem from the AIME 2025
  • Problem description
  • The correct answer

Problems are structured:

{
  "id": "0",
  "problem": "Problem statement...",
  "answer": "Numerical answer"
}

The model is instructed to solve the problem step by step and provide the final answer in the format “ANSWER: $ANSWER”.

Scoring

The benchmark uses exact match scoring to determine if the model’s answer matches the expected answer.

Results

Model Accuracy Std Error
openai/gpt-4.1-nano 0.2667 0.0821
openai/gpt-4o-mini 0.1000 0.0557