SimpleQA/SimpleQA Verified: Measuring short-form factuality in large language models
A benchmark that evaluates the ability of language models to answer short, fact-seeking questions.
Overview
SimpleQA is a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. The benchmark evaluates two key aspects of model performance:
- Accuracy in answering questions
- Ability to abstain from answering when uncertain
This implementation is based off the official simple-eval implementation.
Additionally, we include SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI’s SimpleQA. It addresses critical limitations in OpenAI’s benchmark, including noisy and incorrect labels, topical biases, and question redundancy. See the SimpleQA Verified
section below.
Usage
First, install the dependencies:
uv sync
Then, evaluate against one or more models with:
uv run inspect eval inspect_evals/simpleqa --model openai/gpt-5-nano
uv run inspect eval inspect_evals/simpleqa_verified --model openai/gpt-5-nano
To run multiple tasks simulteneously use inspect eval-set
:
uv run inspect eval-set inspect_evals/simpleqa inspect_evals/simpleqa_verified
You can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval, eval_set
from inspect_evals.simpleqa import simpleqa, simpleqa_verified
eval(simpleqa)
='logs-run-42') eval_set([simpleqa, simpleqa_verified], log_dir
After running evaluations, you can view their logs using the inspect view
command:
uv run inspect view
For VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/simpleqa --limit 10
uv run inspect eval inspect_evals/simpleqa_verified --max-connections 10
uv run inspect eval inspect_evals/simpleqa --temperature 0.5
See uv run inspect eval --help
for all available options.
This Task is configured to match the official implementation, in particular:
- The model has a temperature of 0.5 and max tokens of 2028.
- GPT-4o is used as the grader model with a temperature of 0.5.
Task parameters are provided to modify these configuration values:
inspect eval inspect_evals/simpleqa -T temperature=1 -T max_tokens=100 -T grader_model=anthropic/claude-3-5-sonnet-latest -T grader_temperature=1
SimpleQA
Dataset
The dataset consists of short-form questions and answers.
The questions were selected according to the following criteria:
- Must have one indisputable answer
- Answers must not change over time
- All answers must be supported by evidence
- Questions must be challenging (at least one model got it wrong)
- Questions must be answerable as of 2023
Example questions from the dataset:
Q: Who received the IEEE Frank Rosenblatt Award in 2010?
A: Michio SugenoQ: On which U.S. TV station did the Canadian reality series To Serve and Protect debut?
A: KVOS-TV
The SimpleQA dataset is available on HuggingFace.
Scoring
The benchmark is scored using the following metrics:
- Proportion of questions answered correctly
- Proportion of questions answered incorrectly
- Proportion of questions not attempted
- Proportion of questions answered correctly given that a question was attempted
- F-score: given as the harmonic mean of:
- Proportion of questions answered correctly
- Proportion of questions answered correctly given that a question was attempted
SimpleQA Verified
Dataset
The dataset consists of short-form questions and answers.
SimpleQA Verified aims to improve upon SimpleQA on the following areas:
- SimpleQA questions are derived from a narrow distribution of source documents due to substantial human rater biases
- SimpleQA suffers from incorrect ground truths and disproportionally leans toward specific topics and question formats
- SimpleQA contains a high degree of redundancy with semantically similar or lexically overlapping questions
These issues create a noisy evaluation signal, making it difficult to discern whether performance gains stem from genuine improvements in factual recall or from models overfitting to the benchmark’s specific quirks.
Example questions from the dataset:
Problem | Answer | Topic | Answer Type | Reasoning | Multi-Step |
---|---|---|---|---|---|
Which famous drummer opened a Wahoo’s Fish Taco restaurant in Norco, California, in 2004? | Travis Barker | Music | Person | YES | NO |
What was the age gap between George Frederic Watts and his first wife, Ellen Terry? | 30 years. | Art | Number | NO | YES |
The dataset can be found on Hugging Face. The official implementation is on kaggle.
Scoring
The benchmark is scored using the following metrics:
- Proportion of questions answered correctly
- Proportion of questions answered incorrectly
- Proportion of questions not attempted
- Proportion of questions answered correctly given that a question was attempted
- F-score: given as the harmonic mean of:
- Proportion of questions answered correctly
- Proportion of questions answered correctly given that a question was attempted
Results Comparison with Paper
We ran an eval on September 22, 2025 with the following command
uv run inspect eval inspect_evals/simpleqa_verified -T max_tokens=128000 --model=openai/gpt-5-nano-2025-08-07 --max-connections=200
Model | F-1 Score (Ours/Paper) | Accuracy (Ours/Paper) | Acc.|Attempted (Ours/Paper) | Attempted (Ours/Paper) |
---|---|---|---|---|
openai/gpt-5-nano | 15.2/14.4 | 10.8/10.2 | 25.6/24.2 | 42.2/42.2 |
Note that gpt-5-nano
does not support setting temperature (and is set to OpenAI’s default 1). We set the max completion token to be the limit from OpenAI, since the paper doesn’t specify this. We find that if we set a lower max_tokens
of 2048, the accuracy will take a hit.