AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

Evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information.

Contributed By

@jeqcho

Code

src/inspect_evals/abstention_bench

Paper

https://arxiv.org/pdf/2506.09038

Overview

AbstentionBench is a large-scale benchmark for holistically evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information.

Usage

Note: This eval has dependencies that conflict with other evals in this repo. It ships its own isolated environment under packages/abstention_bench/ rather than sharing the repo’s main one.

Installation

From the repository (recommended)

Sync the eval’s environment from its package directory:

cd packages/abstention_bench
uv sync

Then run evaluations from that directory:

uv run inspect eval inspect_evals/abstention_bench --model openai/gpt-5-nano

Into an existing environment

If you already have a Python environment, install the eval’s package directly — it brings in inspect-evals alongside the eval-specific dependencies:

pip install /path/to/inspect_evals/packages/abstention_bench/

Then run without uv run:

inspect eval inspect_evals/abstention_bench --model openai/gpt-5-nano

You can also import the task directly from Python:

from inspect_ai import eval
from inspect_evals.abstention_bench import abstention_bench
eval(abstention_bench)

From a Jupyter notebook (using Inspect Flow)

Inspect Flow manages the isolated environment for you. Install it once:

pip install inspect-flow

Then from a notebook cell, point Flow at this eval’s pyproject.toml so it builds the isolated venv from the same dependency set as uv sync:

from inspect_flow import FlowSpec, FlowTask, FlowDependencies
from inspect_flow.api import run

spec = FlowSpec(
    execution_type="venv",
    log_dir="logs/",
    dependencies=FlowDependencies(
        dependency_file="path/to/inspect_evals/packages/abstention_bench/pyproject.toml",
    ),
    tasks=[FlowTask(name="inspect_evals/abstention_bench", model="openai/gpt-5-nano")],
)

run(spec)  # blocking; builds isolated venv on first run, cached thereafter

After the run completes, read and analyse logs in the same notebook — the eval’s heavy dependencies are not required for log reading.

Running evaluations

After installing via any of the methods above, you can evaluate models. The examples below assume the repository approach; drop uv run if you are managing dependencies yourself.

uv run inspect eval inspect_evals/abstention_bench --model openai/gpt-5-nano

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download the Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Note: The first time you run AbstentionBench, it will download all 20 datasets (except FreshQA which has been loaded), for a total of 369MB in the Inspect cache directory (typically ~/.cache/inspect_evals/abstention_bench/data/). Note that GPQA is a gated repository, so you have to request access on huggingface, which should be instant.

The grader model is a Llama 3.1 8B Instruct model. Currently we use OpenRouter to access that so remember to specify your OPENROUTER_API_KEY in .env. Otherwise, you can also use another model or run Llama locally by changing the grader_model argument (see Options below).

Options

You can specify which dataset to run on (or leave empty to run on all datasets). The fast sampling method is provided (as in the paper) that subsamples at most 100 questions per dataset, yielding 2,104 samples. The full dataset has 39,558 samples.

uv run inspect eval inspect_evals/abstention_bench -T dataset_name=alcuna
uv run inspect eval inspect_evals/abstention_bench -T use_fast_sampling=True
uv run inspect eval inspect_evals/abstention_bench -T grader_model=openai/gpt-5

# Combine them
uv run inspect eval inspect_evals/abstention_bench -T dataset_name=alcuna -T use_fast_sampling=True

# More connections
uv run inspect eval inspect_evals/abstention_bench -T use_fast_sampling=True --max-connections=200

# The paper uses Llama 3.1 8B Instruct, temperature 0.8 and top_p 0.95, 4k tokens max
uv run inspect eval inspect_evals/abstention_bench -T use_fast_sampling=True --max-connections=200 --model "openrouter/meta-llama/llama-3.1-8b-instruct" --temperature 0.8 --top-p 0.95 --max-tokens 4000

Parameters

`abstention_bench`

dataset_name (Optional[Literal['alcuna', 'bbq', 'big_bench_disambiguate', 'big_bench_known_unknowns', 'coconot', 'falseqa', 'freshqa', 'gpqa', 'gsm8k', 'kuq', 'mediq', 'mmlu_history', 'mmlu_math', 'moralchoice', 'musique', 'qaqa', 'qasper', 'self_aware', 'situated_qa', 'squad2', 'umwp', 'worldsense']]): The name of the dataset to use. If None, all datasets are used. (default: None)
use_fast_sampling (bool): Whether to use fast sampling (smaller subsamples). (default: False)
grader_model (str | Model): The model to use for grading responses. (default: 'openrouter/meta-llama/llama-3.1-8b-instruct')
seed (int): The random seed for reproducible results. (default: 42)

Datasets

AbstentionBench includes 20 diverse datasets that test different aspects of when models should abstain:

ALCUNA [subsampled]: alcuna
Bias Benchmark for QA (BBQ) [subsampled]: bbq
BIG-Bench Disambiguate: big_bench_disambiguate
BIG-Bench Known Unknowns: big_bench_known_unknowns
CoCoNot: coconot
FalseQA: falseqa
FreshQA: freshqa
GPQA: gpqa
GSM8K: gsm8k
Known Unknown Questions (KUQ) [subsampled]: kuq
MediQ: mediq
MMLU History: mmlu_history
MMLU Math: mmlu_math
Musique: musique
Question Answering with Questionable Assumptions (QAQA): qaqa
QASPER: qasper
SelfAware: self_aware
SituatedQA: situated_qa
SQuAD 2.0 [subsampled]: squad2
UMWP: umwp
WorldSense: worldsense

Note that FreshQA is constantly updated. To use the latest FreshQA, here are the directions from the AbstentionBench authors from their README

To set up to run with FreshQA:

Review the list of available FreshQA Google Sheets on the project’s README

Pick a ‘baseline’ date, which should be before the model’s pretraining cut-off, and an ‘updated’ date, which should be after the cut-off

For both baseline and updated, open the Google Sheet and export their contents as CSV (File > Download > Comma Separated Values)

Move these two CSVs into data/freshqa/, and update the paths in configs/dataset/freshqa.yaml accordingly

Note: To exactly replicate our work, use FreshQA_v10282024 and FreshQA_v12182024 as baseline and updated respectively.

The two default FreshQA datasets above have been added to the Inspect Evals repo for convenience.

Scoring

The scorer uses an LLM judge (defaulting to Llama 3.1 8B Instruct as in the paper) to determine whether the model’s response appropriately abstains.

In the logs, target means whether the question should be abstained (true if it should be abstained), and score means whether the grader thinks that the answer is an abstention (true if the grader thinks that the answer is an abstention).

Model Evaluation

We ran the evaluation using the fast subset on August 25 with the following command

uv run inspect eval inspect_evals/abstention_bench -T use_fast_sampling=True --max-connections=200 --model "openrouter/meta-llama/llama-3.1-8b-instruct" --temperature 0.8 --top-p 0.95 --max-tokens 4000

Here are the results:

Scenario (Recall unless otherwise specified)	Inspect Evals	Paper
FreshQADataset	0.762	–
Squad2Dataset	0.759	–
SelfAwareDataset	0.607	–
QAQADataset	0.500	–
UMWP	0.566	–
GSM8K	0.911	0.93
SituatedQAGeoDataset	0.600	–
WorldSenseDataset	0.727	–
MediQDataset	0.370	–
BigBenchDisambiguateDataset	0.667	–
BBQDataset	0.745	–
MoralChoiceDataset	0.467	–
QASPERDataset	0.714	–
FalseQADataset	0.526	–
BigBenchKnownUnknownsDataset	0.913	–
MMLUHistory	0.410	0.38
CoCoNotDataset	0.895	–
GPQA	0.575	0.53
ALCUNADataset	0.723	–
MusiqueDataset	0.829	–
MMLUMath	0.511	0.65
KUQDataset	0.745	–
recall	0.657	–
precision	0.542	–
f1	0.594	–

The paper results are from Table 6 (pg 35).

Code attribution

Code within the configs/, data/ and recipe/ folders are sourced from the original repo. It will be desirable to import this as a package, but currently the original repo is not structured as such.

Changelog

[3-A] - 2026-07-03

Fix abstention prediction extraction. The grader’s verdict is the string "Yes"/"No", but predictions compared score.text against "1.0", which never matched, so every prediction was treated as non-abstention and recall/F1 were 0 for every model. Predictions now convert the verdict with value_to_float ("yes" → 1.0).
Fix the abstention judge grade-pattern. grade_pattern=r"(?i)(yes|no)" bound to the first yes/no in the grader’s output and had no word boundaries, so no was matched inside words like know/cannot/not/unknown; a grader that reasoned before its verdict (e.g. noting the model said “I don’t know”) was scored as no regardless of its final answer. The pattern now binds to the grader’s final Yes/No verdict with word boundaries (r"(?is).*\b(yes|no)\b").

[2-A] - 2026-02-16

Migrate version to new scheme. See #907.

[1.0.1] - 2025-12-18

Adds backoff policy for functions that connect to huggingface servers.