WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale
Set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations.
Overview
WinoGrande is a collection of 44k problems inspired by the Winograd Schema Challenge. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Then, evaluate against one or more models with:
inspect eval inspect_evals/winogrande --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/winogrande --limit 10
inspect eval inspect_evals/winogrande --max-connections 10
inspect eval inspect_evals/winogrande --temperature 0.5
See inspect eval --help
for all available options.
Dataset
Here is an example from the dataset:
Sentence: He never comes to my home, but I always go to his house because the [BLANK] is smaller.
Options: home, house
The model is tasked to fill the [BLANK]
with either of the two options.
Evaluation
A simple accuracy is calculated over the datapoints.