OR-Bench Hard-1K (subset of OR-Bench: Over-Refusal Benchmark)

OR-Bench Hard-1K measures over-refusal in LLMs: the tendency to reject safe prompts that superficially resemble harmful ones. The 1K subset contains prompts rejected by at least 3 frontier models, making it the most challenging split of the 80K dataset. Prompts were generated by rewriting toxic seeds into benign borderline variants using Mixtral, then filtered via an ensemble LLM moderator. Scoring uses an LLM judge classifying responses as direct_answer, direct_refusal, or indirect_refusal; the metric is rejection rate (lower is better).

Maintained By

@haeliotang

Code

or_bench

Paper

https://arxiv.org/abs/2405.20947

Overview

⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.

Source: haeliotang/inspect-evals-orbench@8757ec4

Usage

Installation

This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:

git clone https://github.com/haeliotang/inspect-evals-orbench
cd inspect-evals-orbench
git checkout 8757ec41608f3930e3c2bc4d5619d09a2381727a
uv sync

Running evaluations

CLI

uv run inspect eval src/or_bench/or_bench.py@or_bench_hard_1k --model openai/gpt-5-nano

Python

from inspect_ai import eval
from or_bench.or_bench import or_bench_hard_1k

eval(or_bench_hard_1k(), model="openai/gpt-5-nano")

View logs

uv run inspect view

More information

For the dataset, scorer, task parameters, and validation, see the upstream repo: haeliotang/inspect-evals-orbench.

Options

You can control a variety of options from the command line. For example:

uv run inspect eval src/or_bench/or_bench.py@or_bench_hard_1k --limit 10
uv run inspect eval src/or_bench/or_bench.py@or_bench_hard_1k --max-connections 10
uv run inspect eval src/or_bench/or_bench.py@or_bench_hard_1k --temperature 0.5

See uv run inspect eval --help for all available options.

More command-line options: Inspect docs ↗