RACE-H: A benchmark for testing reading comprehension and reasoning abilities of neural models

Reasoning

Reading comprehension tasks collected from the English exams for middle and high school Chinese students in the age range between 12 to 18.

Contributed By

Overview

RACE is a benchmark for testing reading comprehension and reasoning abilities of neural models. It is constructed from Chinese middle and high school examinations.

Here, we implement the evaluation for the high school subset (RACE-H) of the RACE dataset. It is formulated as a multiple choice question task. The goal is to choose one out of four options in response to a question based on the provided article. The questions are designed to not merely be text spans in the article, and hence answering them requires reasoning.

The prompt template is based on the multiple choice template in OpenAI’s simple-evals.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Then, evaluate against one more models with:

inspect eval inspect_evals/race_h --model openai/gpt-4o

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/race_h --limit 10
inspect eval inspect_evals/race_h --max-connections 10
inspect eval inspect_evals/race_h --temperature 0.5

See inspect eval --help for all available options.

Dataset

Here is an example from the dataset (cropped for brevity):

Article: In a small village in England about 150 years ago, a mail coach was standing on the street. It didn’t come to that village often. People had to pay a lot to get a letter. The person who sent the letter didn’t have to pay the postage, while the receiver had to. …

Question: The first postage stamp was made ____ .

Options: A. in England B. in America C. by Alice D. in 1910

The model is tasked to choose one of the four options.

Scoring

A simple accuracy is calculated over the datapoints.