SimpleQA: Measuring short-form factuality in large language models
A benchmark that evaluates the ability of language models to answer short, fact-seeking questions.
Overview
SimpleQA is a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. The benchmark evaluates two key aspects of model performance:
- Accuracy in answering questions
- Ability to abstain from answering when uncertain
This implementation is based off the official simple-eval implementation.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Then, evaluate against one or more models with:
inspect eval inspect_evals/simpleqa --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/simpleqa --limit 10
inspect eval inspect_evals/simpleqa --max-connections 10
inspect eval inspect_evals/simpleqa --temperature 0.5
See inspect eval --help
for all available options.
This Task is configured to match the official implementation, in particular:
- The model has a temperature of 0.5 and max tokens of 2028.
- GPT-4o is used as the grader model with a temperature of 0.5.
Task parameters are provided to modify these configuration values:
inspect eval inspect_evals/simpleqa -T temperature=1 -T max_tokens=100 -T grader_model=anthropic/claude-3-5-sonnet-latest -T grader_temperature=1
Dataset
The dataset consists of short-form questions and answers.
The questions were selected according to the following criteria:
- Must have one indisputable answer
- Answers must not change over time
- All answers must be supported by evidence
- Questions must be challenging (at least one model got it wrong)
- Questions must be answerable as of 2023
Example questions from the dataset:
Q: Who received the IEEE Frank Rosenblatt Award in 2010?
A: Michio Sugeno
Q: On which U.S. TV station did the Canadian reality series To Serve and Protect debut?
A: KVOS-TV
The dataset can be found on Hugging Face here.
Scoring
The benchmark is scored using the following metrics:
- Proportion of questions answered correctly
- Proportion of questions answered incorrectly
- Proportion of questions not attempted
- Proportion of questions answered correctly given that a question was attempted
- F-score: given as the harmonic mean of:
- Proportion of questions answered correctly
- Proportion of questions answered correctly given that a question was attempted