AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
Evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information.
Overview
AbstentionBench is a large-scale benchmark for holistically evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information.
Usage
First, install the dependencies:
uv sync --extra abstention_benchThe first time you run AbstentionBench, it will download all 20 datasets (except FreshQA which has been loaded), for a total of 369MB in the data/ folder inside abstention_bench. Note that GPQA is a gated repository, so you have to request access on huggingface, which should be instant.
The grader model is a Llama 3.1 8B Instruct model. Currently we use OpenRouter to access that so remember to specify your OPENROUTER_API_KEY in .env. Otherwise, you can also use another model or run Llama locally by changing the grader_model argument (see Options below).
Evaluate against one or more models with:
uv run inspect eval inspect_evals/abstention_bench --model openai/gpt-4oAfter running evaluations, you can view their logs using the inspect view command:
uv run inspect viewIf you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>Options
You can specify which dataset to run on (or leave empty to run on all datasets). The fast sampling method is provided (as in the paper) that subsamples at most 100 questions per dataset, yielding 2,104 samples. The full dataset has 39,558 samples.
uv run inspect eval inspect_evals/abstention_bench -T dataset_name=alcuna
uv run inspect eval inspect_evals/abstention_bench -T use_fast_sampling=True
uv run inspect eval inspect_evals/abstention_bench -T grader_model=openai/gpt-5
# Combine them
uv run inspect eval inspect_evals/abstention_bench -T dataset_name=alcuna -T use_fast_sampling=True
# More connections
uv run inspect eval inspect_evals/abstention_bench -T use_fast_sampling=True --max-connections=200
# The paper uses Llama 3.1 8B Instruct, temperature 0.8 and top_p 0.95, 4k tokens max
uv run inspect eval inspect_evals/abstention_bench -T use_fast_sampling=True --max-connections=200 --model "openrouter/meta-llama/llama-3.1-8b-instruct" --temperature 0.8 --top-p 0.95 --max-tokens 4000Datasets
AbstentionBench includes 20 diverse datasets that test different aspects of when models should abstain:
- ALCUNA [subsampled]:
alcuna
- Bias Benchmark for QA (BBQ) [subsampled]:
bbq
- BIG-Bench Disambiguate:
big_bench_disambiguate
- BIG-Bench Known Unknowns:
big_bench_known_unknowns
- CoCoNot:
coconot
- FalseQA:
falseqa
- FreshQA:
freshqa
- GPQA:
gpqa
- GSM8K:
gsm8k
- Known Unknown Questions (KUQ) [subsampled]:
kuq
- MediQ:
mediq
- MMLU History:
mmlu_history
- MMLU Math:
mmlu_math
- Musique:
musique
- Question Answering with Questionable Assumptions (QAQA):
qaqa
- QASPER:
qasper
- SelfAware:
self_aware
- SituatedQA:
situated_qa
- SQuAD 2.0 [subsampled]:
squad2
- UMWP:
umwp
- WorldSense:
worldsense
Note that FreshQA is constantly updated. To use the latest FreshQA, here are the directions from the AbstentionBench authors from their README
To set up to run with FreshQA:
- Review the list of available FreshQA Google Sheets on the project’s README
- Pick a ‘baseline’ date, which should be before the model’s pretraining cut-off, and an ‘updated’ date, which should be after the cut-off
- For both baseline and updated, open the Google Sheet and export their contents as CSV (File > Download > Comma Separated Values)
- Move these two CSVs into
data/freshqa/, and update the paths inconfigs/dataset/freshqa.yamlaccordinglyNote: To exactly replicate our work, use FreshQA_v10282024 and FreshQA_v12182024 as baseline and updated respectively.
The two default FreshQA datasets above have been added to the Inspect Evals repo for convenience.
Scoring
The scorer uses an LLM judge (defaulting to Llama 3.1 8B Instruct as in the paper) to determine whether the model’s response appropriately abstains.
In the logs, target means whether the question should be abstained (true if it should be abstained), and score means whether the grader thinks that the answer is an abstention (true if the grader thinks that the answer is an abstention).
Model Evaluation
We ran the evaluation using the fast subset on August 25 with the following command
uv run inspect eval inspect_evals/abstention_bench -T use_fast_sampling=True --max-connections=200 --model "openrouter/meta-llama/llama-3.1-8b-instruct" --temperature 0.8 --top-p 0.95 --max-tokens 4000Here are the results:
| Scenario (Recall unless otherwise specified) | Inspect Evals | Paper |
|---|---|---|
| FreshQADataset | 0.762 | – |
| Squad2Dataset | 0.759 | – |
| SelfAwareDataset | 0.607 | – |
| QAQADataset | 0.500 | – |
| UMWP | 0.566 | – |
| GSM8K | 0.911 | 0.93 |
| SituatedQAGeoDataset | 0.600 | – |
| WorldSenseDataset | 0.727 | – |
| MediQDataset | 0.370 | – |
| BigBenchDisambiguateDataset | 0.667 | – |
| BBQDataset | 0.745 | – |
| MoralChoiceDataset | 0.467 | – |
| QASPERDataset | 0.714 | – |
| FalseQADataset | 0.526 | – |
| BigBenchKnownUnknownsDataset | 0.913 | – |
| MMLUHistory | 0.410 | 0.38 |
| CoCoNotDataset | 0.895 | – |
| GPQA | 0.575 | 0.53 |
| ALCUNADataset | 0.723 | – |
| MusiqueDataset | 0.829 | – |
| MMLUMath | 0.511 | 0.65 |
| KUQDataset | 0.745 | – |
| recall | 0.657 | – |
| precision | 0.542 | – |
| f1 | 0.594 | – |
The paper results are from Table 6 (pg 35).
Code attribution
Code within the configs/, data/ and recipe/ folders are sourced from the original repo. It will be desirable to import this as a package, but currently the original repo is not structured as such.