FORTRESS
A dataset of 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains relevant to national security and public safety (NSPS).
Overview
FORTRESS is a dataset of 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains (unclassified information only): Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE), Political Violence & Terrorism, and Criminal & Financial Illicit Activities, with 10 total subcategories across these domains.
Usage
First, install the dependencies:
uv sync --extra fortressEvaluate against one or more models with:
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4o
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4oAfter running evaluations, you can view their logs using the inspect view command:
uv run inspect viewIf you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>Options
You can specify grader models, solvers, and shuffling. For example
uv run inspect eval inspect_evals/fortress_benign -T shuffle=TrueThere are three domains and ten subdomains. The total number of prompts is 500.
Domains and Total Counts
| Domain | # of Prompts |
|---|---|
CBRNE |
180 |
Political Violence & Terrorism |
132 |
Criminal & Financial Illicit Activities |
188 |
You can specify a specific domain to evaluate
uv run inspect eval inspect_evals/fortress_benign -T risk_domain="Political Violence & Terrorism"
uv run inspect eval inspect_evals/fortress_adversarial -T risk_domain="CBRNE"Subdomains and Number of Prompts
| Subdomain | # of Prompts |
|---|---|
Chemical |
37 |
Biological |
30 |
Radiological and Nuclear (WMD) |
47 |
Explosives |
65 |
Terrorism |
87 |
Political Violence |
31 |
Illegal Weapons |
9 |
Coordination of Illicit Activities |
80 |
Fraud |
67 |
Privacy/Scams |
30 |
You can specify a specific subdomain to evaluate
uv run inspect eval inspect_evals/fortress_benign -T risk_subdomain="Illegal Weapons"
uv run inspect eval inspect_evals/fortress_adversarial -T risk_domain="Explosives"Domain and Subdomain Mapping
The classification of subdomains under domains is provided below as a copy from the paper, but note that there exist prompts whose subdomains and domains mapping are different from that of the table below (e.g. Terrorism subdomain but CBRNE domain). For a list of such prompts, see the live huggingface query. The discussion on this issue can be found on huggingface.
| Domain | Subdomain |
|---|---|
CBRNE |
Chemical |
CBRNE |
Biological |
CBRNE |
Radiological and Nuclear (WMD) |
CBRNE |
Explosives |
Political Violence & Terrorism |
Terrorism |
Political Violence & Terrorism |
Political Violence |
Political Violence & Terrorism |
Illegal Weapons |
Criminal & Financial Illicit Activities |
Coordination of Illicit Activities |
Criminal & Financial Illicit Activities |
Fraud |
Criminal & Financial Illicit Activities |
Privacy/Scams |
Model evaluation
We evaluated gpt-4.1-mini on the Political Violence & Terrorism subset on September 14, 2025.
| Model | Domain | ARS (Ours) | ARS (Paper) | ORS (Ours) | ORS (Paper) |
|---|---|---|---|---|---|
| openai/gpt-4.1-mini | Political Violence & Terrorism | 58.5 ± 3.6 | 59.40 | 2.3 ± 1.3 | 0.76 |
ARS: Average Risk Score (out of 100)
ORS: Over-Refusal Score (out of 100)
For the table above, we used the following commands
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_domain="Political Violence & Terrorism" --max-connections 50
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_domain="Political Violence & Terrorism" --max-connections 50We also ran the eval for the subdomains classified under Political Violence & Terrorism in the paper. Note that some of these prompts have a domain that is not Political Violence & Terrorism in the dataset (see the subsection on “Domain and Subdomain Mapping” above).
| Subdomain | ARS (Inspect) | ARS (Paper) | ORS (Inspect) | ORS (Paper) |
|---|---|---|---|---|
| Terrorism | 60.0 ± 4.1 | 59.68 | 0.0 ± 0.0 | Not given |
| Political Violence | 57.1 ± 7.9 | 63.39 | 0.0 ± 0.0 | Not given |
| Illegal Weapons | 66.1 ± 10.1 | 67.86 | 0.0 ± 0.0 | Not given |
For the table above, we used the following commands
# benign
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Terrorism" --max-connections 50
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Political Violence" --max-connections 50
uv run inspect eval inspect_evals/fortress_benign --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Illegal Weapons" --max-connections 50
# adversarial
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Terrorism" --max-connections 50
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Political Violence" --max-connections 50
uv run inspect eval inspect_evals/fortress_adversarial --model openai/gpt-4.1-mini --temperature 0.7 --max-tokens 8192 -T risk_subdomain="Illegal Weapons" --max-connections 50Notes
As in the paper:
- The judge models for adversarial prompts are o3, Sonnet 3.7 and Gemini 2.5 Pro as in the paper.
- The judge model for benign prompts is GPT-4o-mini.
For adversarial prompts, the judge models can sometimes refuse to judge (notably o3 and Gemini) or provide an incorrect number of grades. When this happens, they are considered to have abstained in the majority vote. If all three judges abstain, then the response will be marked INVALID and be dropped from the dataset. We wrote a helper to find the number of judges who abstained in each sample. To use this, run the following
uv run python src/inspect_evals/fortress/cli_helper.py YOUR_EVAL_FILEPATH
# example
uv run python src/inspect_evals/cli_helper.py logs/2025-09-14T09-58-30-04-00_fortress-adversarial_k6foJLN8APxrqrhnzSpkmm.evalAll models were evaluated with an inference temperature of 0.7, except reasoning models, which were evaluated at an inference temperature of 1.0. Max completion tokens vary by model in the paper. Check out Table 9 on page 27 of the paper for model-specific parameters for temperature and max completion tokens.
If “the user specify their own grader models and the number of grader models are even”, or “in cases where a sample only has an even number of valid judge statements”, there’s a possibilty of a tie. In case of a tie, we will default to a “yes” vote (the model is judged to be dangerous in that criterion) so we err on the side on caution.
The judge prompts are not specified in the paper. For the judge prompt used in Inspect, see scorer.py.