AIME 2025: Problems from the American Invitational Mathematics Examination

Mathematics

inspect-evals

A benchmark for evaluating AI’s ability to solve challenging mathematics problems from the 2025 AIME - a prestigious high school mathematics competition.

Contributed By

@jannalulu

Code

src/inspect_evals/aime2025

Paper

https://huggingface.co/datasets/math-ai/aime25

Overview

The AIME 2025 benchmark is based on problems from the American Invitational Mathematics Examination, a prestigious high school mathematics competition.

This benchmark tests a model’s ability to solve challenging mathematics problems by generating step-by-step solutions and providing the correct final answer.

The benchmark uses the math-ai/AIME_2025 dataset from Hugging Face.

Dataset

There are 30 samples in the dataset. Each sample includes:

Unique ID of a mathematics problem from the AIME 2025
Problem description
The correct answer

Problems are structured:

{
  "id": "0",
  "problem": "Problem statement...",
  "answer": "Numerical answer"
}

The model is instructed to solve the problem step by step and provide the final answer in the format “ANSWER: $ANSWER”.

Scoring

The benchmark uses exact match scoring to determine if the model’s answer matches the expected answer.

Results

Model	Accuracy	Std Error
openai/gpt-4.1-nano	0.2667	0.0821
openai/gpt-4o-mini	0.1000	0.0557

Parameters

`aime2025`

No task parameters.

Changelog

[4-A] - 2026-05-11

Fix scorer crash on empty completion (e.g. when model hits max_tokens mid-thought); return INCORRECT instead of raising IndexError.

[3-A] - 2026-03-18

Unify scorer with AIME 2024 and 2026: add last-line extraction via shared aime_common module.

[2-A] - 2026-02-16

Migrate version to new scheme. See #907.

[1.0.1] - 2025-12-18

Adds backoff policy for functions that connect to huggingface servers.