HellaSwag: Commonsense Event Continuation
Tests models’ commonsense reasoning abilities by asking them to select the most likely next step or continuation for a given everyday situation.
Overview
HellaSwag is a dataset for commonsense inference. The model is prompted with a sentence and the model is tasked to pick the sentence choice that is the best suited continuation.
Usage
First, install the dependencies:
uv syncThen, evaluate against one or more models with:
uv run inspect eval inspect_evals/hellaswag --model openai/gpt-5-nanoYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval
from inspect_evals.hellaswag import hellaswag
eval(hellaswag)After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/hellaswag --limit 10
uv run inspect eval inspect_evals/hellaswag --max-connections 10
uv run inspect eval inspect_evals/hellaswag --temperature 0.5See uv run inspect eval --help for all available options. 
Dataset
Here is an example prompt from the dataset (after it has been further processed by Inspect):
Choose the most plausible continuation for the story.
Answer the following multiple choice question. The entire content of your response should be of the following format: ‘ANSWER: $LETTER’ (without quotes) where LETTER is one of A,B,C,D.
A man is sitting on a roof. he
- is using wrap to wrap a pair of skis.
- is ripping level tiles off.
- is holding a rubik’s cube.
- starts pulling up roofing on a roof.
The model is then expected to generate reasoning steps and provide a final answer.
Scoring
An accuracy is calculated over the datapoints.