CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Knowledge

Measure question answering with commonsense prior knowledge.

Overview

CommonsenseQA is a dataset designed to evaluate commonsense reasoning capabilities in natural language processing models. It consists of 12,247 multiple-choice questions that require background knowledge and commonsense to answer correctly. The dataset was constructed using CONCEPTNET, a graph-based knowledge base, where crowd-workers authored questions with complex semantics to challenge existing AI models.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Then, evaluate against one or more models with:

inspect eval inspect_evals/commonsense_qa --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/commonsense_qa --limit 10
inspect eval inspect_evals/commonsense_qa --max-connections 10
inspect eval inspect_evals/commonsense_qa --temperature 0.5

See inspect eval --help for all available options.

Dataset

CommonsenseQA is a multiple-choice question answering dataset with 1,140 samples which require different types of commonsense knowledge to predict the correct answers. Here is an example from the dataset:

Where can I stand on a river to see water falling without getting wet?

  1. Waterfall
  2. Bridge
  3. Valley
  4. Stream
  5. Bottom

The model is required to choose the correct answer from the given options.

Scoring

A simple accuracy is calculated over the datapoints.