MATH: Measuring Mathematical Problem Solving

Mathematics

Dataset of 12,500 challenging competition mathematics problems. Demonstrates fewshot prompting and custom scorers. NOTE: The dataset has been taken down due to a DMCA notice from The Art of Problem Solving.

Overview

MATH is a dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. It consists of 5 levels of difficulty and 7 subjects.

The zero-shot prompt template is based on OpenAI’s simple-evals and the format of the few-shot examples is taken from https://arxiv.org/pdf/2206.14858.

Takedown notice

As of 16 Jan 2025, the MATH dataset has been taken down from HuggingFace due to a DMCA notice from The Art of Problem Solving.

Usage

First, install the dependencies:

uv sync --extra math

Then, evaluate against one or more models with:

uv run inspect eval inspect_evals/math --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/math --limit 10
uv run inspect eval inspect_evals/math --max-connections 10
uv run inspect eval inspect_evals/math --temperature 0.5

See uv run inspect eval --help for all available options.

Dataset

Here is an example from the dataset:

Problem: How many vertical asymptotes does the graph of \[y=\\frac{2}{x^2+x-6}\] have?
Given Solution: The denominator of the rational function factors into \[x^2+x-6=(x-2)(x+3)\]. Since the numerator is always nonzero, there is a vertical asymptote whenever the denominator is \[0\], which occurs for \[x = 2\] and \[x = -3\]. Therefore, the graph has \[\\boxed{2}\] vertical asymptotes.

The model is tasked to solve the problem step by step and return the final answer.

Scoring

Three scoring strategies are used: 1. expression_equivalance: A grader model is used to compare the predicted mathematical answer/expression with the target. 2. expression_exact_match_sympy: The answer and target are evaluated for exact match using the sympy package used in https://arxiv.org/pdf/2206.14858. This implementation is based on EleutherAI’s lm-evaluation-harness/minerva_math 3. expression_exact_match: The answer and target are evaluated for exact match using simple rules, based on EleutherAI’s lm-evaluation-harness/hendrycks_math