SimpleQA/SimpleQA Verified: Measuring short-form factuality in large language models

Knowledge

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions.

Overview

SimpleQA is a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. The benchmark evaluates two key aspects of model performance:

  1. Accuracy in answering questions
  2. Ability to abstain from answering when uncertain

This implementation is based off the official simple-eval implementation.

Additionally, we include SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI’s SimpleQA. It addresses critical limitations in OpenAI’s benchmark, including noisy and incorrect labels, topical biases, and question redundancy. See the SimpleQA Verified section below.

Usage

First, install the dependencies:

uv sync

Then, evaluate against one or more models with:

uv run inspect eval inspect_evals/simpleqa --model openai/gpt-5-nano
uv run inspect eval inspect_evals/simpleqa_verified --model openai/gpt-5-nano

To run multiple tasks simulteneously use inspect eval-set:

uv run inspect eval-set inspect_evals/simpleqa inspect_evals/simpleqa_verified

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval, eval_set
from inspect_evals.simpleqa import simpleqa, simpleqa_verified
eval(simpleqa)
eval_set([simpleqa, simpleqa_verified], log_dir='logs-run-42')

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/simpleqa --limit 10
uv run inspect eval inspect_evals/simpleqa_verified --max-connections 10
uv run inspect eval inspect_evals/simpleqa --temperature 0.5

See uv run inspect eval --help for all available options.

This Task is configured to match the official implementation, in particular:

  • The model has a temperature of 0.5 and max tokens of 2028.
  • GPT-4o is used as the grader model with a temperature of 0.5.

Task parameters are provided to modify these configuration values:

inspect eval inspect_evals/simpleqa -T temperature=1 -T max_tokens=100 -T grader_model=anthropic/claude-3-5-sonnet-latest -T  grader_temperature=1

SimpleQA

Dataset

The dataset consists of short-form questions and answers.

The questions were selected according to the following criteria:

  • Must have one indisputable answer
  • Answers must not change over time
  • All answers must be supported by evidence
  • Questions must be challenging (at least one model got it wrong)
  • Questions must be answerable as of 2023

Example questions from the dataset:

Q: Who received the IEEE Frank Rosenblatt Award in 2010?
A: Michio Sugeno

Q: On which U.S. TV station did the Canadian reality series To Serve and Protect debut?
A: KVOS-TV

The SimpleQA dataset is available on HuggingFace.

Scoring

The benchmark is scored using the following metrics:

  • Proportion of questions answered correctly
  • Proportion of questions answered incorrectly
  • Proportion of questions not attempted
  • Proportion of questions answered correctly given that a question was attempted
  • F-score: given as the harmonic mean of:
    • Proportion of questions answered correctly
    • Proportion of questions answered correctly given that a question was attempted

SimpleQA Verified

Dataset

The dataset consists of short-form questions and answers.

SimpleQA Verified aims to improve upon SimpleQA on the following areas:

  • SimpleQA questions are derived from a narrow distribution of source documents due to substantial human rater biases
  • SimpleQA suffers from incorrect ground truths and disproportionally leans toward specific topics and question formats
  • SimpleQA contains a high degree of redundancy with semantically similar or lexically overlapping questions

These issues create a noisy evaluation signal, making it difficult to discern whether performance gains stem from genuine improvements in factual recall or from models overfitting to the benchmark’s specific quirks.

Example questions from the dataset:

Problem Answer Topic Answer Type Reasoning Multi-Step
Which famous drummer opened a Wahoo’s Fish Taco restaurant in Norco, California, in 2004? Travis Barker Music Person YES NO
What was the age gap between George Frederic Watts and his first wife, Ellen Terry? 30 years. Art Number NO YES

The dataset can be found on Hugging Face. The official implementation is on kaggle.

Scoring

The benchmark is scored using the following metrics:

  • Proportion of questions answered correctly
  • Proportion of questions answered incorrectly
  • Proportion of questions not attempted
  • Proportion of questions answered correctly given that a question was attempted
  • F-score: given as the harmonic mean of:
    • Proportion of questions answered correctly
    • Proportion of questions answered correctly given that a question was attempted

Results Comparison with Paper

We ran an eval on September 22, 2025 with the following command

uv run inspect eval inspect_evals/simpleqa_verified -T max_tokens=128000 --model=openai/gpt-5-nano-2025-08-07 --max-connections=200
Model F-1 Score (Ours/Paper) Accuracy (Ours/Paper) Acc.|Attempted (Ours/Paper) Attempted (Ours/Paper)
openai/gpt-5-nano 15.2/14.4 10.8/10.2 25.6/24.2 42.2/42.2

Note that gpt-5-nano does not support setting temperature (and is set to OpenAI’s default 1). We set the max completion token to be the limit from OpenAI, since the paper doesn’t specify this. We find that if we set a lower max_tokens of 2048, the accuracy will take a hit.