BIG-Bench Extra Hard

Reasoning

A reasoning capability dataset that replaces each task in BIG-Bench-Hard with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty.

Contributed By

@jeqcho

Code

src/inspect_evals/bbeh

Paper

https://arxiv.org/pdf/2502.19187

Overview

BIG-Bench Extra Hard is a reasoning capability dataset that replaces each task in BIG-Bench-Hard with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty.

The full version has 4,520 examples. You can evaluate specific task types (out of the 23 given in BBEH) by specifying benchmark_task. You can also specify to evaluate on the mini subset, which has 460 examples (20 examples randomly selected from each task).

inspect eval inspect_evals/bbeh -T "benchmark_task='boardgame qa'"
inspect eval inspect_evals/bbeh_mini

Tasks

Here’s the list of 23 tasks that you can specify. Note that each task has multiple examples. The dataset has 4,520 examples.

boardgame qa
boolean expressions
buggy tables
causal understanding
disambiguation qa
dyck languages
geometric shapes
hyperbaton
linguini
movie recommendation
multistep arithmetic
nycc
object counting
object properties
sarc triples
shuffled objects
spatial reasoning
sportqa
temporal sequence
time arithmetic
web of lies
word sorting
zebra puzzles

Scoring

The original paper recommends using harmonic mean across tasks for the entire dataset, and micro average (classic accuracy) for the mini dataset.

Model evaluation

BBEH

Tasks	Model	Score on Inspect	Score (original paper)	Eval Date
linguini	openai/gpt-4o	0.140	0.155	August 11, 2025
multistep arithmetic	openai/gpt-4o	0.030	0.055	August 11, 2025
zebra puzzles	openai/gpt-4o	0.240	0.320	August 11, 2025
dyck languages	openai/gpt-4o	0.090	0.080	August 11, 2025
boolean expressions	openai/gpt-4o	0.260	0.225	August 11, 2025
hyperbaton	openai/gpt-4o	0.060	0.075	August 11, 2025
causal understanding	openai/gpt-4o	0.590	0.540	August 11, 2025
sportqa	openai/gpt-4o	0.245	0.250	August 11, 2025
web of lies	openai/gpt-4o	0.135	0.145	August 11, 2025
object properties	openai/gpt-4o	0.010	0.000	August 11, 2025
movie recommendation	openai/gpt-4o	0.595	0.610	August 11, 2025
boardgame qa	openai/gpt-4o	0.375	0.410	August 11, 2025
buggy tables	openai/gpt-4o	0.010	0.005	August 11, 2025
shuffled objects	openai/gpt-4o	0.135	0.140	August 11, 2025
word sorting	openai/gpt-4o	0.175	0.220	August 11, 2025
nycc	openai/gpt-4o	0.195	0.230	August 11, 2025
time arithmetic	openai/gpt-4o	0.405	0.455	August 11, 2025
disambiguation qa	openai/gpt-4o	0.550	0.517	August 11, 2025
geometric shapes	openai/gpt-4o	0.220	0.225	August 11, 2025
temporal sequence	openai/gpt-4o	0.005	0.000	August 11, 2025
spatial reasoning	openai/gpt-4o	0.165	0.140	August 11, 2025
object counting	openai/gpt-4o	0.090	0.065	August 11, 2025
sarc triples	openai/gpt-4o	0.360	0.385	August 11, 2025
BBEH	openai/gpt-4o	0.079	0.060	August 11, 2025

BBEH mini

Tasks	Model	Score on Inspect	Score (original paper)	Eval Date
BBEH Mini	openai/gpt-4o	0.224	0.235	August 7, 2025

Parameters

`bbeh`

benchmark_task (Optional[Literal[‘boardgame qa’, ‘boolean expressions’, ‘buggy tables’, ‘causal understanding’, ‘disambiguation qa’, ‘dyck languages’, ‘geometric shapes’, ‘hyperbaton’, ‘linguini’, ‘movie recommendation’, ‘multistep arithmetic’, ‘nycc’, ‘object counting’, ‘object properties’, ‘sarc triples’, ‘shuffled objects’, ‘spatial reasoning’, ‘sportqa’, ‘temporal sequence’, ‘time arithmetic’, ‘web of lies’, ‘word sorting’, ‘zebra puzzles’]]): (default: None)
solver (Solver | None): (default: None)

`bbeh_mini`

solver (Solver | None): (default: None)
seed (int): (default: 42)

Changelog

[2-A] - 2026-02-16

Migrate version to new scheme. See #907.

[1.0.1] - 2025-12-18

Adds backoff policy for functions that connect to huggingface servers.