FORTRESS
A dataset of 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains relevant to national security and public safety (NSPS).
Overview
FORTRESS is a dataset of 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains (unclassified information only): Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE), Political Violence & Terrorism, and Criminal & Financial Illicit Activities, with 10 total subcategories across these domains.
Usage
First, install the dependencies:
uv sync --extra fortress
Evaluate against one or more models with:
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4o
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
uv run inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can specify grader models, solvers, and shuffling. For example
uv run inspect eval inspect_evals/fortress_benign -T shuffle=True
There are three domains and ten subdomains. The total number of prompts is 500.
Domains and Total Counts
Domain | # of Prompts |
---|---|
CBRNE |
180 |
Political Violence & Terrorism |
132 |
Criminal & Financial Illicit Activities |
188 |
You can specify a specific domain to evaluate
uv run inspect eval inspect_evals/fortress_benign -T risk_domain="Political Violence & Terrorism"
uv run inspect eval inspect_evals/fortress_adversarial -T risk_domain="CBRNE"
Subdomains and Number of Prompts
Subdomain | # of Prompts |
---|---|
Chemical |
37 |
Biological |
30 |
Radiological and Nuclear (WMD) |
47 |
Explosives |
65 |
Terrorism |
87 |
Political Violence |
31 |
Illegal Weapons |
9 |
Coordination of Illicit Activities |
80 |
Fraud |
67 |
Privacy/Scams |
30 |
You can specify a specific subdomain to evaluate
uv run inspect eval inspect_evals/fortress_benign -T risk_subdomain="Illegal Weapons"
uv run inspect eval inspect_evals/fortress_adversarial -T risk_domain="Explosives"
Domain and Subdomain Mapping
The classification of subdomains under domains is provided below as a copy from the paper, but note that there exist prompts whose subdomains and domains mapping are different from that of the table below (e.g. Terrorism
subdomain but CBRNE
domain). For a list of such prompts, see the live huggingface query. The discussion on this issue can be found on huggingface.
Domain | Subdomain |
---|---|
CBRNE |
Chemical |
CBRNE |
Biological |
CBRNE |
Radiological and Nuclear (WMD) |
CBRNE |
Explosives |
Political Violence & Terrorism |
Terrorism |
Political Violence & Terrorism |
Political Violence |
Political Violence & Terrorism |
Illegal Weapons |
Criminal & Financial Illicit Activities |
Coordination of Illicit Activities |
Criminal & Financial Illicit Activities |
Fraud |
Criminal & Financial Illicit Activities |
Privacy/Scams |
Model evaluation
We evaluated gpt-4.1-mini on the Political Violence & Terrorism
subset on September 14, 2025.
Model | Domain | ARS (Ours) | ARS (Paper) | ORS (Ours) | ORS (Paper) |
---|---|---|---|---|---|
openai/gpt-4.1-mini | Political Violence & Terrorism | 58.5 ± 3.6 | 59.40 | 2.3 ± 1.3 | 0.76 |
ARS: Average Risk Score (out of 100)
ORS: Over-Refusal Score (out of 100)
For the table above, we used the following commands
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_domain="Political Violence & Terrorism" --max-connections 50
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_domain="Political Violence & Terrorism" --max-connections 50
We also ran the eval for the subdomains classified under Political Violence & Terrorism
in the paper. Note that some of these prompts have a domain that is not Political Violence & Terrorism
in the dataset (see the subsection on “Domain and Subdomain Mapping” above).
Subdomain | ARS (Inspect) | ARS (Paper) | ORS (Inspect) | ORS (Paper) |
---|---|---|---|---|
Terrorism | 60.0 ± 4.1 | 59.68 | 0.0 ± 0.0 | Not given |
Political Violence | 57.1 ± 7.9 | 63.39 | 0.0 ± 0.0 | Not given |
Illegal Weapons | 66.1 ± 10.1 | 67.86 | 0.0 ± 0.0 | Not given |
For the table above, we used the following commands
# benign
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Terrorism" --max-connections 50
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Political Violence" --max-connections 50
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Illegal Weapons" --max-connections 50
# adversarial
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Terrorism" --max-connections 50
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Political Violence" --max-connections 50
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Illegal Weapons" --max-connections 50
Notes
As in the paper:
- The judge models for adversarial prompts are o3, Sonnet 3.7 and Gemini 2.5 Pro as in the paper.
- The judge model for benign prompts is GPT-4o-mini.
For adversarial prompts, the judge models can sometimes refuse to judge (notably o3 and Gemini) or provide an incorrect number of grades. When this happens, they are considered to have abstained in the majority vote. If all three judges abstain, then the response will be marked INVALID
and be dropped from the dataset. We wrote a helper to find the number of judges who abstained in each sample. To use this, run the following
uv run python src/inspect_evals/fortress/cli_helper.py YOUR_EVAL_FILEPATH
# example
uv run python src/inspect_evals/cli_helper.py logs/2025-09-14T09-58-30-04-00_fortress-adversarial_k6foJLN8APxrqrhnzSpkmm.eval
All models were evaluated with an inference temperature of 0.7, except reasoning models, which were evaluated at an inference temperature of 1.0. Max completion tokens vary by model in the paper. Check out Table 9 on page 27 of the paper for model-specific parameters for temperature and max completion tokens.
If “the user specify their own grader models and the number of grader models are even”, or “in cases where a sample only has an even number of valid judge statements”, there’s a possibilty of a tie. In case of a tie, we will default to a “yes” vote (the model is judged to be dangerous in that criterion) so we err on the side on caution.
The judge prompts are not specified in the paper. For the judge prompt used in Inspect, see scorer.py
.