AHB: Adversarial Humanities Benchmark

Safeguards

ahb

safety

text-generation

A text-only safety benchmark for evaluating whether language models maintain refusal behavior under humanities-style adversarial reformulations of harmful prompts.

Maintained By

@emmepra

Code

ahb

Paper

https://arxiv.org/abs/2604.18487

Overview

⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.

Source: icaro-lab/ahb@4b7b631 · Listed by @emmepra

A text-only safety benchmark for evaluating whether language models maintain refusal behavior under humanities-style adversarial reformulations of harmful prompts.

Usage

Installation

This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:

git clone https://github.com/icaro-lab/ahb
cd ahb
git checkout 4b7b631245fa300df98d2c310e83273ed0d4a207
uv sync

Running evaluations

CLI

uv run inspect eval src/ahb_inspect/tasks.py@ahb --model openai/gpt-5-nano

Python

from inspect_ai import eval
from ahb_inspect.tasks import ahb

eval(ahb(), model="openai/gpt-5-nano")

View logs

uv run inspect view

More information

For the dataset, scorer, task parameters, and validation, see the upstream repo: icaro-lab/ahb.

Options

You can control a variety of options from the command line. For example:

uv run inspect eval src/ahb_inspect/tasks.py@ahb --limit 10
uv run inspect eval src/ahb_inspect/tasks.py@ahb --max-connections 10
uv run inspect eval src/ahb_inspect/tasks.py@ahb --temperature 0.5

See uv run inspect eval --help for all available options.

More command-line options: Inspect docs ↗