BBH: Challenging BIG-Bench Tasks
Tests AI models on a suite of 23 challenging BIG-Bench tasks that previously proved difficult even for advanced language models to solve.
Overview
BIG-Bench-Hard is a subset of 27 challenging tasks from the BIG-Bench suite, where prior language models did not outperform average human-rater performance. The benchmark tasks focus on algorithmic, commonsense, and multi-step reasoning. Evaluated models can be prompted zero-shot, 3-shot answer only or 3-shot Chain of Thought.
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evalsIf you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv syncRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/bbh --model openai/gpt-5-nanoYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval
from inspect_evals.bbh import bbh
eval(bbh)After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/bbh --limit 10
uv run inspect eval inspect_evals/bbh --max-connections 10
uv run inspect eval inspect_evals/bbh --temperature 0.5See uv run inspect eval --help for all available options.
Parameters
bbh
subset_name(str | None): Name of subset to use. See BBHDatasetRegistry.MULTIPLE_CHOICE_DATASETS, BBHDatasetRegistry.BINARY_CHOICE_DATASETS, BBHDatasetRegistry.EXACT_MATCH_DATASETS or BBHDatasetRegistry.DYCK_DATASET for valid options. (default:None)prompt_type(str): Type of prompt to use. One of [“zero_shot”, “answer_only”, “chain_of_thought”]. (default:'answer_only')
You can evaluate specific datasets and specify prompting behavior using the dataset_name and prompt_type task parameters. For example:
inspect eval inspect_evals/bbh -T "dataset_name=date_understanding"
inspect eval inspect_evals/bbh -T "prompt_type=chain_of_thought" Dataset
BIG-Bench-Hard is a question-answering dataset comprising 27 unique tasks, most containing 250 samples, except for ‘snarks’ with 178 samples, ‘penguins_in_a_table’ with 146, and ‘causal_judgement’ with 187. In total, the benchmark includes 6,511 samples across the 27 tasks, each requiring different forms of algorithmic, commonsense and multistep reasoning to predict the correct answer.
Tasks
Here’s a list of tasks categorized by dataset type:
Multiple Choice Datasets
- date_understanding
- disambiguation_qa
- geometric_shapes
- hyperbaton
- logical_deduction_five_objects
- logical_deduction_seven_objects
- logical_deduction_three_objects
- movie_recommendation
- penguins_in_a_table
- reasoning_about_colored_objects
- ruin_names
- salient_translation_error_detection
- snarks
- temporal_sequences
- tracking_shuffled_objects_five_objects
- tracking_shuffled_objects_seven_objects
- tracking_shuffled_objects_three_objects
Binary Choice Datasets
- boolean_expressions (True, False)
- causal_judgement (Yes, No)
- formal_fallacies (valid, invalid)
- navigate (Yes, No)
- sports_understanding (yes, no)
- web_of_lies (Yes, No)
Open Answer Datasets
- multistep_arithmetic_two (integer)
- object_counting (natural number)
- word_sorting (list of words)
- dyck_languages (closing brackets)
Here is an example from the dataset:
Q: “It is not always easy to grasp who is consuming which products. The following argument pertains to this question: Every infrequent user of Paul Mitchell shampoo is either a rare consumer of Nioxin shampoo or a loyal buyer of Caress soap, or both. No regular consumer of Lush soap is a rare consumer of Nioxin shampoo and, in the same time, a loyal buyer of Caress soap. It follows that whoever is an infrequent user of Paul Mitchell shampoo is not a regular consumer of Lush soap.” Is the argument, given the explicitly stated premises, deductively valid or invalid? Options: - valid - invalid
The model is required to choose the correct answer from the given options.
Scoring
A simple accuracy is calculated over the samples.