GSM8K: Training Verifiers to Solve Math Word Problems

Mathematics

Dataset of 8.5K high quality linguistically diverse grade school math word problems. Demostrates fewshot prompting.

Overview

GSM8K is a dataset consisting of diverse grade school math word problems.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Then, evaluate against one or more models with:

inspect eval inspect_evals/gsm8k --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/gsm8k --limit 10
inspect eval inspect_evals/gsm8k --max-connections 10
inspect eval inspect_evals/gsm8k --temperature 0.5

See inspect eval --help for all available options.

Dataset

Here is an example prompt from the dataset (after it has been further processed by Inspect):

Solve the following math problem step by step. The last line of your response should be of the form “ANSWER: $ANSWER” (without quotes) where $ANSWER is the answer to the problem.

Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers’ market?

Remember to put your answer on its own line at the end in the form “ANSWER: $ANSWER” (without quotes) where $ANSWER is the answer to the problem, and you do not need to use a \boxed command.

Reasoning:

The model is then expected to generate reasoning steps and provide a final answer.

Scoring

An accuracy is calculated over the datapoints. The correctness is based on an exact-match criterion and whether the model’s final answer is correct.