DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Reasoning

Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).

Contributed By

@xeon27

Code

src/inspect_evals/drop

Paper

https://arxiv.org/abs/1903.00161

Overview

DROP is a crowdsourced, adversarially-created, 96k reading comprehension benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev]"

Then, evaluate against one or more models with:

inspect eval inspect_evals/drop --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/drop --limit 10
inspect eval inspect_evals/drop --max-connections 10
inspect eval inspect_evals/drop --temperature 0.5

See inspect eval --help for all available options.

Dataset

The DROP dataset contains 9,535 samples. Here is an example from the dataset:

Passage: The Eagles began their season at Bank of America Stadium for a Week 1 duel with the Carolina Panthers. Philadelphia trailed early in the first quarter as Panthers running back DeAngelo Williams ran 11 yards for a Carolina touchdown on their first drive. The Eagles answered with a 49-yard field goal from kicker David Akers. In the second quarter, Philadelphia exploded with points as defensive end Victor Abiamiri returned a fumble 2 yards for a touchdown, wide receiver DeSean Jackson returned a punt 85 yards for a touchdown, and quarterback Donovan McNabb completed a 9-yard TD pass to tight end Brent Celek and a 4-yard touchdown pass to running back Brian Westbrook. Carolina ended the period with kicker John Kasay booting a 22-yard field goal. In the third quarter, the Eagles closed out their scoring with McNabb scoring on a 3-yard touchdown run. However, he was hit late by several Carolina tacklers who cracked his ribs on the right side, knocking him out of the game. Kevin Kolb came in for McNabb and closed out the game for the victorious Eagles.

Question: How many yards was the longest scoring play of the first quarter?

The model is tasked to answer the question by referring to multiple parts in the passage. The prompts are based on OpenAI’s simple-evals.

Scoring

A simple accuracy is calculated over the datapoints.