ARC: AI2 Reasoning Challenge
Dataset of natural, grade-school science multiple-choice questions (authored for human tests).
Overview
ARC is a benchmark using natural science questions to evaluate a model’s knowledge and reasoning capabilities. The dataset ships with Easy
and Challenge
sets.
Usage
First, install the dependencies:
uv sync
Then, evaluate against one or more models with:
uv run inspect eval inspect_evals/arc_easy --model openai/gpt-5-nano
uv run inspect eval inspect_evals/arc_challenge --model openai/gpt-5-nano
To run multiple tasks simulteneously use inspect eval-set
:
uv run inspect eval-set inspect_evals/arc_easy inspect_evals/arc_challenge
You can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval, eval_set
from inspect_evals.arc import arc_easy, arc_challenge
eval(arc_easy)
='logs-run-42') eval_set([arc_easy, arc_challenge], log_dir
After running evaluations, you can view their logs using the inspect view
command:
uv run inspect view
For VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/arc_easy --limit 10
uv run inspect eval inspect_evals/arc_challenge --max-connections 10
uv run inspect eval inspect_evals/arc_easy --temperature 0.5
See uv run inspect eval --help
for all available options.
Dataset
The ARC dataset is a dataset of 7,787 genuine grade-school level, multiple-choice science questions. Here is an example prompt (after being further process by Inspect):
Answer the following multiple choice question. The entire content of your response should be of the following format: ’ANSWER: $LETTER (without quotes) where LETTER is one of A,B,C,D.
An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?
- Planetary density will decrease.
- Planetary years will become longer.
- Planetary days will become shorter.
- Planetary gravity will become stronger.
The model is then tasked to pick the correct choice.
Scoring
A simple accuracy is calculated over the datapoints.