FORTRESS

Safeguards

A dataset of 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains relevant to national security and public safety (NSPS).

Overview

FORTRESS is a dataset of 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains (unclassified information only): Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE), Political Violence & Terrorism, and Criminal & Financial Illicit Activities, with 10 total subcategories across these domains.

Usage

First, install the dependencies:

uv sync --extra fortress

Evaluate against one or more models with:

uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4o
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can specify grader models, solvers, and shuffling. For example

uv run inspect eval inspect_evals/fortress_benign -T shuffle=True

There are three domains and ten subdomains. The total number of prompts is 500.

Domains and Total Counts

Domain # of Prompts
CBRNE 180
Political Violence & Terrorism 132
Criminal & Financial Illicit Activities 188

You can specify a specific domain to evaluate

uv run inspect eval inspect_evals/fortress_benign -T risk_domain="Political Violence & Terrorism"
uv run inspect eval inspect_evals/fortress_adversarial -T risk_domain="CBRNE"

Subdomains and Number of Prompts

Subdomain # of Prompts
Chemical 37
Biological 30
Radiological and Nuclear (WMD) 47
Explosives 65
Terrorism 87
Political Violence 31
Illegal Weapons 9
Coordination of Illicit Activities 80
Fraud 67
Privacy/Scams 30

You can specify a specific subdomain to evaluate

uv run inspect eval inspect_evals/fortress_benign -T risk_subdomain="Illegal Weapons"
uv run inspect eval inspect_evals/fortress_adversarial -T risk_domain="Explosives"

Domain and Subdomain Mapping

The classification of subdomains under domains is provided below as a copy from the paper, but note that there exist prompts whose subdomains and domains mapping are different from that of the table below (e.g. Terrorism subdomain but CBRNE domain). For a list of such prompts, see the live huggingface query. The discussion on this issue can be found on huggingface.

Domain Subdomain
CBRNE Chemical
CBRNE Biological
CBRNE Radiological and Nuclear (WMD)
CBRNE Explosives
Political Violence & Terrorism Terrorism
Political Violence & Terrorism Political Violence
Political Violence & Terrorism Illegal Weapons
Criminal & Financial Illicit Activities Coordination of Illicit Activities
Criminal & Financial Illicit Activities Fraud
Criminal & Financial Illicit Activities Privacy/Scams

Model evaluation

We evaluated gpt-4.1-mini on the Political Violence & Terrorism subset on September 14, 2025.

Model Domain ARS (Ours) ARS (Paper) ORS (Ours) ORS (Paper)
openai/gpt-4.1-mini Political Violence & Terrorism 58.5 ± 3.6 59.40 2.3 ± 1.3 0.76

ARS: Average Risk Score (out of 100)
ORS: Over-Refusal Score (out of 100)

For the table above, we used the following commands

uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_domain="Political Violence & Terrorism" --max-connections 50
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_domain="Political Violence & Terrorism" --max-connections 50

We also ran the eval for the subdomains classified under Political Violence & Terrorism in the paper. Note that some of these prompts have a domain that is not Political Violence & Terrorism in the dataset (see the subsection on “Domain and Subdomain Mapping” above).

Subdomain ARS (Inspect) ARS (Paper) ORS (Inspect) ORS (Paper)
Terrorism 60.0 ± 4.1 59.68 0.0 ± 0.0 Not given
Political Violence 57.1 ± 7.9 63.39 0.0 ± 0.0 Not given
Illegal Weapons 66.1 ± 10.1 67.86 0.0 ± 0.0 Not given

For the table above, we used the following commands

# benign
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Terrorism" --max-connections 50
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Political Violence" --max-connections 50
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Illegal Weapons" --max-connections 50

# adversarial
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Terrorism" --max-connections 50
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Political Violence" --max-connections 50
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Illegal Weapons" --max-connections 50

Notes

As in the paper:

  • The judge models for adversarial prompts are o3, Sonnet 3.7 and Gemini 2.5 Pro as in the paper.
  • The judge model for benign prompts is GPT-4o-mini.

For adversarial prompts, the judge models can sometimes refuse to judge (notably o3 and Gemini) or provide an incorrect number of grades. When this happens, they are considered to have abstained in the majority vote. If all three judges abstain, then the response will be marked INVALID and be dropped from the dataset. We wrote a helper to find the number of judges who abstained in each sample. To use this, run the following

uv run python src/inspect_evals/fortress/cli_helper.py YOUR_EVAL_FILEPATH

# example
uv run python src/inspect_evals/cli_helper.py logs/2025-09-14T09-58-30-04-00_fortress-adversarial_k6foJLN8APxrqrhnzSpkmm.eval

All models were evaluated with an inference temperature of 0.7, except reasoning models, which were evaluated at an inference temperature of 1.0. Max completion tokens vary by model in the paper. Check out Table 9 on page 27 of the paper for model-specific parameters for temperature and max completion tokens.

If “the user specify their own grader models and the number of grader models are even”, or “in cases where a sample only has an even number of valid judge statements”, there’s a possibilty of a tie. In case of a tie, we will default to a “yes” vote (the model is judged to be dangerous in that criterion) so we err on the side on caution.

The judge prompts are not specified in the paper. For the judge prompt used in Inspect, see scorer.py.