SimpleQA: Measuring short-form factuality in large language models

Knowledge

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions.

Overview

SimpleQA is a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. The benchmark evaluates two key aspects of model performance:

  1. Accuracy in answering questions
  2. Ability to abstain from answering when uncertain

This implementation is based off the official simple-eval implementation.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Then, evaluate against one or more models with:

inspect eval inspect_evals/simpleqa --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/simpleqa --limit 10
inspect eval inspect_evals/simpleqa --max-connections 10
inspect eval inspect_evals/simpleqa --temperature 0.5

See inspect eval --help for all available options.

This Task is configured to match the official implementation, in particular:

  • The model has a temperature of 0.5 and max tokens of 2028.
  • GPT-4o is used as the grader model with a temperature of 0.5.

Task parameters are provided to modify these configuration values:

inspect eval inspect_evals/simpleqa -T temperature=1 -T max_tokens=100 -T grader_model=anthropic/claude-3-5-sonnet-latest -T  grader_temperature=1

Dataset

The dataset consists of short-form questions and answers.

The questions were selected according to the following criteria:

  • Must have one indisputable answer
  • Answers must not change over time
  • All answers must be supported by evidence
  • Questions must be challenging (at least one model got it wrong)
  • Questions must be answerable as of 2023

Example questions from the dataset:

Q: Who received the IEEE Frank Rosenblatt Award in 2010?
A: Michio Sugeno

Q: On which U.S. TV station did the Canadian reality series To Serve and Protect debut?
A: KVOS-TV

The dataset can be found on Hugging Face here.

Scoring

The benchmark is scored using the following metrics:

  • Proportion of questions answered correctly
  • Proportion of questions answered incorrectly
  • Proportion of questions not attempted
  • Proportion of questions answered correctly given that a question was attempted
  • F-score: given as the harmonic mean of:
    • Proportion of questions answered correctly
    • Proportion of questions answered correctly given that a question was attempted