BBH: Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Reasoning

Suite of 23 challenging BIG-Bench tasks for which prior language model evaluations did not outperform the average human-rater.

Overview

BIG-Bench-Hard is a subset of 27 challenging tasks from the BIG-Bench suite, where prior language models did not outperform average human-rater performance. The benchmark tasks focus on algorithmic, commonsense, and multi-step reasoning. Evaluated models can be prompted zero-shot, 3-shot answer only or 3-shot Chain of Thought.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Then, evaluate against one or more models with:

inspect eval inspect_evals/bbh --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/bbh --limit 10
inspect eval inspect_evals/bbh --max-connections 10
inspect eval inspect_evals/bbh --temperature 0.5

See inspect eval --help for all available options.

You can evaluate specific datasets and specify prompting behavior using the dataset_name and prompt_type task parameters. For example:

inspect eval inspect_evals/bbh -T "dataset_name=date_understanding"
inspect eval inspect_evals/bbh -T "prompt_type=chain_of_thought" 

Dataset

BIG-Bench-Hard is a question-answering dataset comprising 27 unique tasks, most containing 250 samples, except for ‘snarks’ with 178 samples, ‘penguins_in_a_table’ with 146, and ‘causal_judgement’ with 187. In total, the benchmark includes 6,511 samples across the 27 tasks, each requiring different forms of algorithmic, commonsense and multistep reasoning to predict the correct answer.

Tasks

Here’s a list of tasks categorized by dataset type:

Multiple Choice Datasets

  1. date_understanding
  2. disambiguation_qa
  3. geometric_shapes
  4. hyperbaton
  5. logical_deduction_five_objects
  6. logical_deduction_seven_objects
  7. logical_deduction_three_objects
  8. movie_recommendation
  9. penguins_in_a_table
  10. reasoning_about_colored_objects
  11. ruin_names
  12. salient_translation_error_detection
  13. snarks
  14. temporal_sequences
  15. tracking_shuffled_objects_five_objects
  16. tracking_shuffled_objects_seven_objects
  17. tracking_shuffled_objects_three_objects

Binary Choice Datasets

  1. boolean_expressions (True, False)
  2. causal_judgement (Yes, No)
  3. formal_fallacies (valid, invalid)
  4. navigate (Yes, No)
  5. sports_understanding (yes, no)
  6. web_of_lies (Yes, No)

Open Answer Datasets

  1. multistep_arithmetic_two (integer)
  2. object_counting (natural number)
  3. word_sorting (list of words)
  4. dyck_languages (closing brackets)

Here is an example from the dataset:

Q: “It is not always easy to grasp who is consuming which products. The following argument pertains to this question: Every infrequent user of Paul Mitchell shampoo is either a rare consumer of Nioxin shampoo or a loyal buyer of Caress soap, or both. No regular consumer of Lush soap is a rare consumer of Nioxin shampoo and, in the same time, a loyal buyer of Caress soap. It follows that whoever is an infrequent user of Paul Mitchell shampoo is not a regular consumer of Lush soap.” Is the argument, given the explicitly stated premises, deductively valid or invalid? Options: - valid - invalid

The model is required to choose the correct answer from the given options.

Scoring

A simple accuracy is calculated over the samples.