BIG-Bench Extra Hard

Reasoning

A reasoning capability dataset that replaces each task in BIG-Bench-Hard with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty.

Overview

BIG-Bench Extra Hard is a reasoning capability dataset that replaces each task in BIG-Bench-Hard with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty.

The full version has 4,520 examples. You can evaluate specific task types (out of the 23 given in BBEH) by specifying benchmark_task. You can also specify to evaluate on the mini subset, which has 460 examples (20 examples randomly selected from each task).

inspect eval inspect_evals/bbeh -T "benchmark_task='boardgame qa'"
inspect eval inspect_evals/bbeh_mini

Tasks

Here’s the list of 23 tasks that you can specify. Note that each task has multiple examples. The dataset has 4,520 examples.

  • boardgame qa
  • boolean expressions
  • buggy tables
  • causal understanding
  • disambiguation qa
  • dyck languages
  • geometric shapes
  • hyperbaton
  • linguini
  • movie recommendation
  • multistep arithmetic
  • nycc
  • object counting
  • object properties
  • sarc triples
  • shuffled objects
  • spatial reasoning
  • sportqa
  • temporal sequence
  • time arithmetic
  • web of lies
  • word sorting
  • zebra puzzles

Scoring

The original paper recommends using harmonic mean across tasks for the entire dataset, and micro average (classic accuracy) for the mini dataset.

Model evaluation

BBEH

Tasks Model Score on Inspect Score (original paper) Eval Date
linguini openai/gpt-4o 0.140 0.155 August 11, 2025
multistep arithmetic openai/gpt-4o 0.030 0.055 August 11, 2025
zebra puzzles openai/gpt-4o 0.240 0.320 August 11, 2025
dyck languages openai/gpt-4o 0.090 0.080 August 11, 2025
boolean expressions openai/gpt-4o 0.260 0.225 August 11, 2025
hyperbaton openai/gpt-4o 0.060 0.075 August 11, 2025
causal understanding openai/gpt-4o 0.590 0.540 August 11, 2025
sportqa openai/gpt-4o 0.245 0.250 August 11, 2025
web of lies openai/gpt-4o 0.135 0.145 August 11, 2025
object properties openai/gpt-4o 0.010 0.000 August 11, 2025
movie recommendation openai/gpt-4o 0.595 0.610 August 11, 2025
boardgame qa openai/gpt-4o 0.375 0.410 August 11, 2025
buggy tables openai/gpt-4o 0.010 0.005 August 11, 2025
shuffled objects openai/gpt-4o 0.135 0.140 August 11, 2025
word sorting openai/gpt-4o 0.175 0.220 August 11, 2025
nycc openai/gpt-4o 0.195 0.230 August 11, 2025
time arithmetic openai/gpt-4o 0.405 0.455 August 11, 2025
disambiguation qa openai/gpt-4o 0.550 0.517 August 11, 2025
geometric shapes openai/gpt-4o 0.220 0.225 August 11, 2025
temporal sequence openai/gpt-4o 0.005 0.000 August 11, 2025
spatial reasoning openai/gpt-4o 0.165 0.140 August 11, 2025
object counting openai/gpt-4o 0.090 0.065 August 11, 2025
sarc triples openai/gpt-4o 0.360 0.385 August 11, 2025
BBEH openai/gpt-4o 0.079 0.060 August 11, 2025

BBEH mini

Tasks Model Score on Inspect Score (original paper) Eval Date
BBEH Mini openai/gpt-4o 0.224 0.235 August 7, 2025