AIME 2026: Problems from the American Invitational Mathematics Examination
A benchmark for evaluating AI’s ability to solve challenging mathematics problems from the 2026 AIME - a prestigious high school mathematics competition.
Overview
Evaluation of AI performance on the American Invitational Mathematics Examination (AIME) 2026. AIME is a prestigious 15-problem, 3-hour competition for high school students who qualify through the AMC 10/12. Problems require ingenuity and mathematical insight across algebra, combinatorics, geometry, and number theory, with integer answers in the range 0–999.
Dataset
The dataset is sourced from math-ai/aime26 on HuggingFace and contains 30 problems (15 from AIME I and 15 from AIME II). Each problem has a corresponding integer answer in the range 0–999.
Scoring
Each problem is scored as correct/incorrect using numeric matching. The scorer strips LaTeX \boxed{} formatting before comparison. Results are reported as accuracy with standard error.
Evaluation Report
The following results were obtained using uv run inspect eval inspect_evals/aime2026 --model <model> with eval version 1-A. As of 2026-03-18, there are no publicly available results to compare against.
| Model | Accuracy | Std Error |
|---|---|---|
| openai/gpt-5-nano | 0.7667 | 0.079 |
| openai/gpt-5.4-nano | 0.4000 | 0.091 |