SimpleQA/SimpleQA Verified: Measuring short-form factuality in large language models
A benchmark that evaluates the ability of language models to answer short, fact-seeking questions.
Overview
SimpleQA is a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. The benchmark evaluates two key aspects of model performance:
- Accuracy in answering questions
- Ability to abstain from answering when uncertain
This implementation is based off the official simple-eval implementation.
Additionally, we include SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI’s SimpleQA. It addresses critical limitations in OpenAI’s benchmark, including noisy and incorrect labels, topical biases, and question redundancy. See the SimpleQA Verified section below.
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evalsIf you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv syncRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/simpleqa --model openai/gpt-5-nano
uv run inspect eval inspect_evals/simpleqa_verified --model openai/gpt-5-nanoTo run multiple tasks simulteneously use inspect eval-set:
uv run inspect eval-set inspect_evals/simpleqa inspect_evals/simpleqa_verifiedYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval, eval_set
from inspect_evals.simpleqa import simpleqa, simpleqa_verified
eval(simpleqa)
eval_set([simpleqa, simpleqa_verified], log_dir='logs-run-42')After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/simpleqa --limit 10
uv run inspect eval inspect_evals/simpleqa_verified --max-connections 10
uv run inspect eval inspect_evals/simpleqa --temperature 0.5See uv run inspect eval --help for all available options.
Parameters
simpleqa, simpleqa_verified
scorer(Literal[‘tool’, ‘original’]): Scorer to use. “tool” (default) uses schema_tool_graded_scorer; “original” uses the paper’s string-matching scorer. (default:'tool')
Task variants
This eval provides two tasks — one per dataset:
| Task | Dataset | Default Scorer | Configuration |
|---|---|---|---|
simpleqa |
SimpleQA | Tool-calling (schema_tool_graded_scorer) |
No hardcoded defaults — uses framework/CLI settings |
simpleqa_verified |
SimpleQA Verified | Tool-calling (schema_tool_graded_scorer) |
No hardcoded defaults |
Both tasks accept a scorer parameter to switch between the default tool-calling scorer and the paper’s original string-matching scorer:
# Default: robust tool-calling scorer
inspect eval inspect_evals/simpleqa --model openai/gpt-4o-mini
# Paper-faithful: string-matching scorer (A/B/C letters)
inspect eval inspect_evals/simpleqa -T scorer=original --model openai/gpt-4o-miniReproducing paper results
Generation config and grader model are not hardcoded in the tasks. To reproduce the paper’s exact settings, use --generate-config and --model-role:
# SimpleQA with paper defaults
inspect eval inspect_evals/simpleqa -T scorer=original \
--generate-config src/inspect_evals/simpleqa/paper_config/simpleqa.yaml \
--model-role 'grader={"model": "openai/gpt-4o", "temperature": 0.5}' \
--model openai/gpt-4o-mini
# SimpleQA Verified with paper defaults
inspect eval inspect_evals/simpleqa_verified -T scorer=original \
--generate-config src/inspect_evals/simpleqa/paper_config/simpleqa_verified.yaml \
--model-role 'grader={"model": "openai/gpt-4.1-2025-04-14", "temperature": 1.0}' \
--model openai/gpt-4o-miniThe YAML config files in paper_config/ contain the paper-faithful GenerateConfig values (temperature, max_tokens) with provenance comments. The grader model is specified via --model-role because --generate-config only sets generation config for the evaluated model.
If no grader role is bound, the evaluated model is used as the grader.
SimpleQA
Dataset
The dataset consists of short-form questions and answers.
The questions were selected according to the following criteria:
- Must have one indisputable answer
- Answers must not change over time
- All answers must be supported by evidence
- Questions must be challenging (at least one model got it wrong)
- Questions must be answerable as of 2023
Example questions from the dataset:
Q: Who received the IEEE Frank Rosenblatt Award in 2010?
A: Michio SugenoQ: On which U.S. TV station did the Canadian reality series To Serve and Protect debut?
A: KVOS-TV
The SimpleQA dataset is available on HuggingFace.
Scoring
The benchmark is scored using the following metrics:
- Proportion of questions answered correctly
- Proportion of questions answered incorrectly
- Proportion of questions not attempted
- Proportion of questions answered correctly given that a question was attempted
- F-score: given as the harmonic mean of:
- Proportion of questions answered correctly
- Proportion of questions answered correctly given that a question was attempted
SimpleQA Verified
Dataset
The dataset consists of short-form questions and answers.
SimpleQA Verified aims to improve upon SimpleQA on the following areas:
- SimpleQA questions are derived from a narrow distribution of source documents due to substantial human rater biases
- SimpleQA suffers from incorrect ground truths and disproportionally leans toward specific topics and question formats
- SimpleQA contains a high degree of redundancy with semantically similar or lexically overlapping questions
These issues create a noisy evaluation signal, making it difficult to discern whether performance gains stem from genuine improvements in factual recall or from models overfitting to the benchmark’s specific quirks.
Example questions from the dataset:
| Problem | Answer | Topic | Answer Type | Reasoning | Multi-Step |
|---|---|---|---|---|---|
| Which famous drummer opened a Wahoo’s Fish Taco restaurant in Norco, California, in 2004? | Travis Barker | Music | Person | YES | NO |
| What was the age gap between George Frederic Watts and his first wife, Ellen Terry? | 30 years. | Art | Number | NO | YES |
The dataset can be found on Hugging Face. The official implementation is on kaggle.
Scoring
The benchmark is scored using the following metrics:
- Proportion of questions answered correctly
- Proportion of questions answered incorrectly
- Proportion of questions not attempted
- Proportion of questions answered correctly given that a question was attempted
- F-score: given as the harmonic mean of:
- Proportion of questions answered correctly
- Proportion of questions answered correctly given that a question was attempted
Evaluation Report
Results Comparison with Paper
Original scorer (-T scorer=original, September 22, 2025)
uv run inspect eval inspect_evals/simpleqa_verified --max-tokens 128000 --model=openai/gpt-5-nano-2025-08-07 --max-connections=200| Model | F-1 Score (Ours/Paper) | Accuracy (Ours/Paper) | Acc.|Attempted (Ours/Paper) | Attempted (Ours/Paper) |
|---|---|---|---|---|
| openai/gpt-5-nano | 15.2/14.4 | 10.8/10.2 | 25.6/24.2 | 42.2/42.2 |
Tool-calling scorer (default, March 23, 2026)
uv run inspect eval inspect_evals/simpleqa_verified \
--model openai/gpt-5-nano-2025-08-07 \
--generate-config src/inspect_evals/simpleqa/paper_config/simpleqa_verified.yaml \
--model-role 'grader={"model": "openai/gpt-4.1-2025-04-14", "temperature": 1.0}' \
--max-tokens 128000 \
--max-connections 200| Model | Accuracy (Ours/Paper) |
|---|---|
| openai/gpt-5-nano | 11.0/10.2 |
Note that gpt-5-nano does not support setting temperature (and is set to OpenAI’s default 1). We set the max completion tokens to 128000 (the limit from OpenAI), since the paper doesn’t specify this.
Changelog
[3-B] - 2026-03-17
- Refactor SimpleQA and SimpleQA Verified to use a single task per dataset with external paper config via
--generate-configand--model-role, plus-T scorer=originalfor paper-faithful scoring.
[2-A] - 2026-02-16
- Migrate version to new scheme. See #907.
[1.0.1] - 2025-12-18
- Adds backoff policy for functions that connect to huggingface servers.