BIG-Bench Extra Hard
A reasoning capability dataset that replaces each task in BIG-Bench-Hard with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty.
Overview
BIG-Bench Extra Hard is a reasoning capability dataset that replaces each task in BIG-Bench-Hard with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty.
The full version has 4,520 examples. You can evaluate specific task types (out of the 23 given in BBEH) by specifying benchmark_task
. You can also specify to evaluate on the mini
subset, which has 460 examples (20 examples randomly selected from each task).
inspect eval inspect_evals/bbeh -T "benchmark_task='boardgame qa'"
inspect eval inspect_evals/bbeh_mini
Tasks
Here’s the list of 23 tasks that you can specify. Note that each task has multiple examples. The dataset has 4,520 examples.
- boardgame qa
- boolean expressions
- buggy tables
- causal understanding
- disambiguation qa
- dyck languages
- geometric shapes
- hyperbaton
- linguini
- movie recommendation
- multistep arithmetic
- nycc
- object counting
- object properties
- sarc triples
- shuffled objects
- spatial reasoning
- sportqa
- temporal sequence
- time arithmetic
- web of lies
- word sorting
- zebra puzzles
Scoring
The original paper recommends using harmonic mean across tasks for the entire dataset, and micro average (classic accuracy) for the mini dataset.
Model evaluation
BBEH
Tasks | Model | Score on Inspect | Score (original paper) | Eval Date |
---|---|---|---|---|
linguini | openai/gpt-4o | 0.140 | 0.155 | August 11, 2025 |
multistep arithmetic | openai/gpt-4o | 0.030 | 0.055 | August 11, 2025 |
zebra puzzles | openai/gpt-4o | 0.240 | 0.320 | August 11, 2025 |
dyck languages | openai/gpt-4o | 0.090 | 0.080 | August 11, 2025 |
boolean expressions | openai/gpt-4o | 0.260 | 0.225 | August 11, 2025 |
hyperbaton | openai/gpt-4o | 0.060 | 0.075 | August 11, 2025 |
causal understanding | openai/gpt-4o | 0.590 | 0.540 | August 11, 2025 |
sportqa | openai/gpt-4o | 0.245 | 0.250 | August 11, 2025 |
web of lies | openai/gpt-4o | 0.135 | 0.145 | August 11, 2025 |
object properties | openai/gpt-4o | 0.010 | 0.000 | August 11, 2025 |
movie recommendation | openai/gpt-4o | 0.595 | 0.610 | August 11, 2025 |
boardgame qa | openai/gpt-4o | 0.375 | 0.410 | August 11, 2025 |
buggy tables | openai/gpt-4o | 0.010 | 0.005 | August 11, 2025 |
shuffled objects | openai/gpt-4o | 0.135 | 0.140 | August 11, 2025 |
word sorting | openai/gpt-4o | 0.175 | 0.220 | August 11, 2025 |
nycc | openai/gpt-4o | 0.195 | 0.230 | August 11, 2025 |
time arithmetic | openai/gpt-4o | 0.405 | 0.455 | August 11, 2025 |
disambiguation qa | openai/gpt-4o | 0.550 | 0.517 | August 11, 2025 |
geometric shapes | openai/gpt-4o | 0.220 | 0.225 | August 11, 2025 |
temporal sequence | openai/gpt-4o | 0.005 | 0.000 | August 11, 2025 |
spatial reasoning | openai/gpt-4o | 0.165 | 0.140 | August 11, 2025 |
object counting | openai/gpt-4o | 0.090 | 0.065 | August 11, 2025 |
sarc triples | openai/gpt-4o | 0.360 | 0.385 | August 11, 2025 |
BBEH | openai/gpt-4o | 0.079 | 0.060 | August 11, 2025 |
BBEH mini
Tasks | Model | Score on Inspect | Score (original paper) | Eval Date |
---|---|---|---|---|
BBEH Mini | openai/gpt-4o | 0.224 | 0.235 | August 7, 2025 |