CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Knowledge

Evaluates an AI model’s ability to correctly answer everyday questions that rely on basic commonsense knowledge and understanding of the world.

Contributed By

@jjallaire

Code

src/inspect_evals/commonsense_qa

Paper

https://arxiv.org/abs/1811.00937

Overview

CommonsenseQA is a dataset designed to evaluate commonsense reasoning capabilities in natural language processing models. It consists of 12,247 multiple-choice questions that require background knowledge and commonsense to answer correctly. The dataset was constructed using CONCEPTNET, a graph-based knowledge base, where crowd-workers authored questions with complex semantics to challenge existing AI models.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev]"

Then, evaluate against one or more models with:

inspect eval inspect_evals/commonsense_qa --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/commonsense_qa --limit 10
inspect eval inspect_evals/commonsense_qa --max-connections 10
inspect eval inspect_evals/commonsense_qa --temperature 0.5

See inspect eval --help for all available options.

Dataset

CommonsenseQA is a multiple-choice question answering dataset with 1,140 samples which require different types of commonsense knowledge to predict the correct answers. Here is an example from the dataset:

Where can I stand on a river to see water falling without getting wet?

Waterfall

Bridge

Valley

Stream

Bottom

The model is required to choose the correct answer from the given options.

Scoring

A simple accuracy is calculated over the datapoints.