BBH: Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Suite of 23 challenging BIG-Bench tasks for which prior language model evaluations did not outperform the average human-rater.
Overview
BIG-Bench-Hard is a subset of 27 challenging tasks from the BIG-Bench suite, where prior language models did not outperform average human-rater performance. The benchmark tasks focus on algorithmic, commonsense, and multi-step reasoning. Evaluated models can be prompted zero-shot, 3-shot answer only or 3-shot Chain of Thought.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Then, evaluate against one or more models with:
inspect eval inspect_evals/bbh --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/bbh --limit 10
inspect eval inspect_evals/bbh --max-connections 10
inspect eval inspect_evals/bbh --temperature 0.5
See inspect eval --help
for all available options.
You can evaluate specific datasets and specify prompting behavior using the dataset_name
and prompt_type
task parameters. For example:
inspect eval inspect_evals/bbh -T "dataset_name=date_understanding"
inspect eval inspect_evals/bbh -T "prompt_type=chain_of_thought"
Dataset
BIG-Bench-Hard is a question-answering dataset comprising 27 unique tasks, most containing 250 samples, except for ‘snarks’ with 178 samples, ‘penguins_in_a_table’ with 146, and ‘causal_judgement’ with 187. In total, the benchmark includes 6,511 samples across the 27 tasks, each requiring different forms of algorithmic, commonsense and multistep reasoning to predict the correct answer.
Tasks
Here’s a list of tasks categorized by dataset type:
Multiple Choice Datasets
- date_understanding
- disambiguation_qa
- geometric_shapes
- hyperbaton
- logical_deduction_five_objects
- logical_deduction_seven_objects
- logical_deduction_three_objects
- movie_recommendation
- penguins_in_a_table
- reasoning_about_colored_objects
- ruin_names
- salient_translation_error_detection
- snarks
- temporal_sequences
- tracking_shuffled_objects_five_objects
- tracking_shuffled_objects_seven_objects
- tracking_shuffled_objects_three_objects
Binary Choice Datasets
- boolean_expressions (True, False)
- causal_judgement (Yes, No)
- formal_fallacies (valid, invalid)
- navigate (Yes, No)
- sports_understanding (yes, no)
- web_of_lies (Yes, No)
Open Answer Datasets
- multistep_arithmetic_two (integer)
- object_counting (natural number)
- word_sorting (list of words)
- dyck_languages (closing brackets)
Here is an example from the dataset:
Q: “It is not always easy to grasp who is consuming which products. The following argument pertains to this question: Every infrequent user of Paul Mitchell shampoo is either a rare consumer of Nioxin shampoo or a loyal buyer of Caress soap, or both. No regular consumer of Lush soap is a rare consumer of Nioxin shampoo and, in the same time, a loyal buyer of Caress soap. It follows that whoever is an infrequent user of Paul Mitchell shampoo is not a regular consumer of Lush soap.” Is the argument, given the explicitly stated premises, deductively valid or invalid? Options: - valid - invalid
The model is required to choose the correct answer from the given options.
Scoring
A simple accuracy is calculated over the samples.