BIG-Bench Extra Hard
A reasoning capability dataset that replaces each task in BIG-Bench-Hard with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty.
Overview
BIG-Bench Extra Hard is a reasoning capability dataset that replaces each task in BIG-Bench-Hard with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty.
The full version has 4,520 examples. You can evaluate specific task types (out of the 23 given in BBEH) by specifying benchmark_task. You can also specify to evaluate on the mini subset, which has 460 examples (20 examples randomly selected from each task).
inspect eval inspect_evals/bbeh -T "benchmark_task='boardgame qa'"
inspect eval inspect_evals/bbeh_miniTasks
Here’s the list of 23 tasks that you can specify. Note that each task has multiple examples. The dataset has 4,520 examples.
- boardgame qa
- boolean expressions
- buggy tables
- causal understanding
- disambiguation qa
- dyck languages
- geometric shapes
- hyperbaton
- linguini
- movie recommendation
- multistep arithmetic
- nycc
- object counting
- object properties
- sarc triples
- shuffled objects
- spatial reasoning
- sportqa
- temporal sequence
- time arithmetic
- web of lies
- word sorting
- zebra puzzles
Scoring
The original paper recommends using harmonic mean across tasks for the entire dataset, and micro average (classic accuracy) for the mini dataset.
Model evaluation
BBEH
| Tasks | Model | Score on Inspect | Score (original paper) | Eval Date |
|---|---|---|---|---|
| linguini | openai/gpt-4o | 0.140 | 0.155 | August 11, 2025 |
| multistep arithmetic | openai/gpt-4o | 0.030 | 0.055 | August 11, 2025 |
| zebra puzzles | openai/gpt-4o | 0.240 | 0.320 | August 11, 2025 |
| dyck languages | openai/gpt-4o | 0.090 | 0.080 | August 11, 2025 |
| boolean expressions | openai/gpt-4o | 0.260 | 0.225 | August 11, 2025 |
| hyperbaton | openai/gpt-4o | 0.060 | 0.075 | August 11, 2025 |
| causal understanding | openai/gpt-4o | 0.590 | 0.540 | August 11, 2025 |
| sportqa | openai/gpt-4o | 0.245 | 0.250 | August 11, 2025 |
| web of lies | openai/gpt-4o | 0.135 | 0.145 | August 11, 2025 |
| object properties | openai/gpt-4o | 0.010 | 0.000 | August 11, 2025 |
| movie recommendation | openai/gpt-4o | 0.595 | 0.610 | August 11, 2025 |
| boardgame qa | openai/gpt-4o | 0.375 | 0.410 | August 11, 2025 |
| buggy tables | openai/gpt-4o | 0.010 | 0.005 | August 11, 2025 |
| shuffled objects | openai/gpt-4o | 0.135 | 0.140 | August 11, 2025 |
| word sorting | openai/gpt-4o | 0.175 | 0.220 | August 11, 2025 |
| nycc | openai/gpt-4o | 0.195 | 0.230 | August 11, 2025 |
| time arithmetic | openai/gpt-4o | 0.405 | 0.455 | August 11, 2025 |
| disambiguation qa | openai/gpt-4o | 0.550 | 0.517 | August 11, 2025 |
| geometric shapes | openai/gpt-4o | 0.220 | 0.225 | August 11, 2025 |
| temporal sequence | openai/gpt-4o | 0.005 | 0.000 | August 11, 2025 |
| spatial reasoning | openai/gpt-4o | 0.165 | 0.140 | August 11, 2025 |
| object counting | openai/gpt-4o | 0.090 | 0.065 | August 11, 2025 |
| sarc triples | openai/gpt-4o | 0.360 | 0.385 | August 11, 2025 |
| BBEH | openai/gpt-4o | 0.079 | 0.060 | August 11, 2025 |
BBEH mini
| Tasks | Model | Score on Inspect | Score (original paper) | Eval Date |
|---|---|---|---|---|
| BBEH Mini | openai/gpt-4o | 0.224 | 0.235 | August 7, 2025 |
Parameters
bbeh
benchmark_task(Optional[Literal[‘boardgame qa’, ‘boolean expressions’, ‘buggy tables’, ‘causal understanding’, ‘disambiguation qa’, ‘dyck languages’, ‘geometric shapes’, ‘hyperbaton’, ‘linguini’, ‘movie recommendation’, ‘multistep arithmetic’, ‘nycc’, ‘object counting’, ‘object properties’, ‘sarc triples’, ‘shuffled objects’, ‘spatial reasoning’, ‘sportqa’, ‘temporal sequence’, ‘time arithmetic’, ‘web of lies’, ‘word sorting’, ‘zebra puzzles’]]): (default:None)solver(Solver | None): (default:None)
bbeh_mini
solver(Solver | None): (default:None)seed(int): (default:42)