SimpleQA: Measuring short-form factuality in large language models

Knowledge

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions.

Contributed By

@osc245

Code

src/inspect_evals/simpleqa

Paper

https://arxiv.org/abs/2411.04368

Overview

SimpleQA is a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. The benchmark evaluates two key aspects of model performance:

Accuracy in answering questions
Ability to abstain from answering when uncertain

This implementation is based off the official simple-eval implementation.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev]"

Then, evaluate against one or more models with:

inspect eval inspect_evals/simpleqa --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/simpleqa --limit 10
inspect eval inspect_evals/simpleqa --max-connections 10
inspect eval inspect_evals/simpleqa --temperature 0.5

See inspect eval --help for all available options.

This Task is configured to match the official implementation, in particular:

The model has a temperature of 0.5 and max tokens of 2028.
GPT-4o is used as the grader model with a temperature of 0.5.

Task parameters are provided to modify these configuration values:

inspect eval inspect_evals/simpleqa -T temperature=1 -T max_tokens=100 -T grader_model=anthropic/claude-3-5-sonnet-latest -T  grader_temperature=1

Dataset

The dataset consists of short-form questions and answers.

The questions were selected according to the following criteria:

Must have one indisputable answer
Answers must not change over time
All answers must be supported by evidence
Questions must be challenging (at least one model got it wrong)
Questions must be answerable as of 2023

Example questions from the dataset:

Q: Who received the IEEE Frank Rosenblatt Award in 2010?
A: Michio Sugeno

Q: On which U.S. TV station did the Canadian reality series To Serve and Protect debut?
A: KVOS-TV

The dataset can be found on Hugging Face here.

Scoring

The benchmark is scored using the following metrics:

Proportion of questions answered correctly
Proportion of questions answered incorrectly
Proportion of questions not attempted
Proportion of questions answered correctly given that a question was attempted
F-score: given as the harmonic mean of:
- Proportion of questions answered correctly
- Proportion of questions answered correctly given that a question was attempted