Hangman Bench

Reasoning

hangman-bench

games

tools

A benchmark for testing AI models’ ability to play the classic game of Hangman. Built as a demonstration of how to enable models to play games using tools within the Inspect framework.

Maintained By

@MattFisher

Code

hangman-bench

Overview

⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.

Source: MattFisher/hangman-bench@9f1f396 · Listed by @Scott-Simmons

A benchmark for testing AI models’ ability to play the classic game of Hangman. Built as a demonstration of how to enable models to play games using tools within the Inspect framework.

Usage

Installation

This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:

git clone https://github.com/MattFisher/hangman-bench
cd hangman-bench
git checkout 9f1f396e68191eb5ec99dbffbc6fea63609bd0b6
uv sync

Running evaluations

CLI

uv run inspect eval src/hangman_bench/hangman.py@hangman --model openai/gpt-5-nano

Python

from inspect_ai import eval
from hangman_bench.hangman import hangman

eval(hangman(), model="openai/gpt-5-nano")

View logs

uv run inspect view

More information

For the dataset, scorer, task parameters, and validation, see the upstream repo: MattFisher/hangman-bench.

Options

You can control a variety of options from the command line. For example:

uv run inspect eval src/hangman_bench/hangman.py@hangman --limit 10
uv run inspect eval src/hangman_bench/hangman.py@hangman --max-connections 10
uv run inspect eval src/hangman_bench/hangman.py@hangman --temperature 0.5

See uv run inspect eval --help for all available options.

More command-line options: Inspect docs ↗

Evaluation Report

Timestamp: April 2026

Commit: 9f1f396

uv run inspect eval src/hangman_bench/hangman.py@hangman --model openai/gpt-5-nano --limit 100

Model	Provider	Accuracy	Stderr	Time
openai/gpt-5-nano	OpenAI	0.930	0.026	12m 17s

Notes:

Run on 100 samples, 1 epoch.
The headline accuracy is the aggregate score from the eval’s grouped game scorer. Per-difficulty breakdown: v_easy/easy/medium/hard = 0.95, v_hard = 0.85.