BBH: Challenging BIG-Bench Tasks

Reasoning

Tests AI models on a suite of 23 challenging BIG-Bench tasks that previously proved difficult even for advanced language models to solve.

Contributed By

@JoschkaCBraun

Code

src/inspect_evals/bbh

Paper

https://arxiv.org/abs/2210.09261

Overview

BIG-Bench-Hard is a subset of 27 challenging tasks from the BIG-Bench suite, where prior language models did not outperform average human-rater performance. The benchmark tasks focus on algorithmic, commonsense, and multi-step reasoning. Evaluated models can be prompted zero-shot, 3-shot answer only or 3-shot Chain of Thought.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev]"

Then, evaluate against one or more models with:

inspect eval inspect_evals/bbh --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/bbh --limit 10
inspect eval inspect_evals/bbh --max-connections 10
inspect eval inspect_evals/bbh --temperature 0.5

See inspect eval --help for all available options.

You can evaluate specific datasets and specify prompting behavior using the dataset_name and prompt_type task parameters. For example:

inspect eval inspect_evals/bbh -T "dataset_name=date_understanding"
inspect eval inspect_evals/bbh -T "prompt_type=chain_of_thought"

Dataset

BIG-Bench-Hard is a question-answering dataset comprising 27 unique tasks, most containing 250 samples, except for ‘snarks’ with 178 samples, ‘penguins_in_a_table’ with 146, and ‘causal_judgement’ with 187. In total, the benchmark includes 6,511 samples across the 27 tasks, each requiring different forms of algorithmic, commonsense and multistep reasoning to predict the correct answer.

Tasks

Here’s a list of tasks categorized by dataset type:

Multiple Choice Datasets

date_understanding
disambiguation_qa
geometric_shapes
hyperbaton
logical_deduction_five_objects
logical_deduction_seven_objects
logical_deduction_three_objects
movie_recommendation
penguins_in_a_table
reasoning_about_colored_objects
ruin_names
salient_translation_error_detection
snarks
temporal_sequences
tracking_shuffled_objects_five_objects
tracking_shuffled_objects_seven_objects
tracking_shuffled_objects_three_objects

Binary Choice Datasets

boolean_expressions (True, False)
causal_judgement (Yes, No)
formal_fallacies (valid, invalid)
navigate (Yes, No)
sports_understanding (yes, no)
web_of_lies (Yes, No)

Open Answer Datasets

multistep_arithmetic_two (integer)
object_counting (natural number)
word_sorting (list of words)
dyck_languages (closing brackets)

Here is an example from the dataset:

Q: “It is not always easy to grasp who is consuming which products. The following argument pertains to this question: Every infrequent user of Paul Mitchell shampoo is either a rare consumer of Nioxin shampoo or a loyal buyer of Caress soap, or both. No regular consumer of Lush soap is a rare consumer of Nioxin shampoo and, in the same time, a loyal buyer of Caress soap. It follows that whoever is an infrequent user of Paul Mitchell shampoo is not a regular consumer of Lush soap.” Is the argument, given the explicitly stated premises, deductively valid or invalid? Options: - valid - invalid

The model is required to choose the correct answer from the given options.

Scoring

A simple accuracy is calculated over the samples.