PIQA: Physical Commonsense Reasoning Test
Measures the model’s ability to apply practical, everyday commonsense reasoning about physical objects and scenarios through simple decision-making questions.
Overview
PIQA is a benchmark to measure the model’s physical commonsense reasoning.
Usage
First, install the dependencies:
uv sync
Then, evaluate against one or more models with:
uv run inspect eval inspect_evals/piqa --model openai/gpt-5-nano
You can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval
from inspect_evals.piqa import piqa
eval(piqa)
After running evaluations, you can view their logs using the inspect view
command:
uv run inspect view
For VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/piqa --limit 10
uv run inspect eval inspect_evals/piqa --max-connections 10
uv run inspect eval inspect_evals/piqa --temperature 0.5
See uv run inspect eval --help
for all available options.
Dataset
Here is an example prompt from the dataset (after it has been further processed by Inspect):
The entire content of your response should be of the following format: ‘ANSWER:$LETTER’ (without quotes) where LETTER is one of A,B.
Given either a question or a statement followed by two possible solutions labelled A and B, choose the most appropriate solution. If a question is given, the solutions answer the question. If a statement is given, the solutions explain how to achieve the statement.
How do I ready a guinea pig cage for it’s new occupants?
- Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish.
- Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish.
The model is then tasked to pick the correct choice.
Scoring
A simple accuracy is calculated over the datapoints.