MATH: Measuring Mathematical Problem Solving

Mathematics

Dataset of 12,500 challenging competition mathematics problems. Demonstrates fewshot prompting and custom scorers. NOTE: The dataset has been taken down due to a DMCA notice from The Art of Problem Solving.

Contributed By

@xeon27

Code

src/inspect_evals/math

Paper

https://arxiv.org/abs/2103.03874

Overview

MATH is a dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. It consists of 5 levels of difficulty and 7 subjects.

The zero-shot prompt template is based on OpenAI’s simple-evals and the format of the few-shot examples is taken from https://arxiv.org/pdf/2206.14858.

Takedown notice

As of 16 Jan 2025, the original MATH dataset has been taken down from HuggingFace due to a DMCA notice from The Art of Problem Solving.

This eval has been updated to use the MATH-lighteval dataset instead.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[math]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra math

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/math --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.math import math
eval(math)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/math --limit 10
uv run inspect eval inspect_evals/math --max-connections 10
uv run inspect eval inspect_evals/math --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`math`

levels (Union[list[Literal[1, 2, 3, 4, 5]], Literal[1, 2, 3, 4, 5]]): List of levels to filter on, 1 to 5. (default: [])
subjects (Union[list[Literal[‘algebra’, ‘counting & probability’, ‘geometry’, ‘intermediate algebra’, ‘number theory’, ‘prealgebra’, ‘precalculus’]], Literal[‘algebra’, ‘counting & probability’, ‘geometry’, ‘intermediate algebra’, ‘number theory’, ‘prealgebra’, ‘precalculus’]]): List of subjects to filter on. (default: [])
fewshot (int): The number of fewshots to include (default: 0)
fewshot_seed (int): The seed used when selecting fewshots (default: 42)
grader_model (str | None): The model used to grade the samples (default: None)
shuffle (bool): Whether to shuffle the dataset for evaluation (default: True)

Dataset

Here is an example from the dataset:

Problem: How many vertical asymptotes does the graph of \[y=\\frac{2}{x^2+x-6}\] have?
Given Solution: The denominator of the rational function factors into \[x^2+x-6=(x-2)(x+3)\]. Since the numerator is always nonzero, there is a vertical asymptote whenever the denominator is \[0\], which occurs for \[x = 2\] and \[x = -3\]. Therefore, the graph has \[\\boxed{2}\] vertical asymptotes.

The model is tasked to solve the problem step by step and return the final answer.

Scoring

Three scoring strategies are used:

expression_equivalance: A grader model is used to compare the predicted mathematical answer/expression with the target.
expression_exact_match_sympy: The answer and target are evaluated for exact match using the sympy package used in https://arxiv.org/pdf/2206.14858. This implementation is based on EleutherAI’s lm-evaluation-harness/minerva_math
expression_exact_match: The answer and target are evaluated for exact match using simple rules, based on EleutherAI’s lm-evaluation-harness/hendrycks_math