AIME 2026: Problems from the American Invitational Mathematics Examination

Mathematics

inspect-evals

A benchmark for evaluating AI’s ability to solve challenging mathematics problems from the 2026 AIME - a prestigious high school mathematics competition.

Contributed By

@joeda

Code

src/inspect_evals/aime2026

Paper

https://huggingface.co/datasets/math-ai/aime26

Overview

Evaluation of AI performance on the American Invitational Mathematics Examination (AIME) 2026. AIME is a prestigious 15-problem, 3-hour competition for high school students who qualify through the AMC 10/12. Problems require ingenuity and mathematical insight across algebra, combinatorics, geometry, and number theory, with integer answers in the range 0–999.

Dataset

The dataset is sourced from math-ai/aime26 on HuggingFace and contains 30 problems (15 from AIME I and 15 from AIME II). Each problem has a corresponding integer answer in the range 0–999.

Scoring

Each problem is scored as correct/incorrect using numeric matching. The scorer strips LaTeX \boxed{} formatting before comparison. Results are reported as accuracy with standard error.

Evaluation Report

The following results were obtained using uv run inspect eval inspect_evals/aime2026 --model <model> with eval version 1-A. As of 2026-03-18, there are no publicly available results to compare against.

Model	Accuracy	Std Error
openai/gpt-5-nano	0.7667	0.079
openai/gpt-5.4-nano	0.4000	0.091

Changelog

[2-A] - 2026-05-11

Fix scorer crash on empty completion (e.g. when model hits max_tokens mid-thought); return INCORRECT instead of raising IndexError.