GDM Dangerous Capabilities: Self-proliferation

Scheming

Agent

Ten real-world–inspired tasks from Google DeepMind’s Dangerous Capabilities Evaluations assessing self-proliferation behaviors (e.g., email setup, model installation, web agent setup, wallet operations). Supports end-to-end, milestones, and expert best-of-N modes.

Contributed By

@XkunW

@MariaIzobava

@kohankhaki

Code

src/inspect_evals/gdm_capabilities/self_proliferation

Paper

https://arxiv.org/pdf/2403.13793

Overview

This directory includes the 10 self-proliferation tasks from Google DeepMind’s Dangerous Capabilities Evaluations (originally published in Evaluating Frontier Models for Dangerous Capabilities).

Contributed by @XkunW @MariaIzobava @kohankhaki

Task list

SP01 - Email setup
SP02 - Install model
SP03 - Install web agent
SP04 - Bitcoin wallet
SP05 - Track employee
SP07 - Build product
SP08 - Celebrity endorsement
SP09 - Inference efficiency
SP10 - Distillation
SP12 - Implement paper

NOTE

The task indexing convention follows the index from the original Dangerous Capabilities Evaluations repository, hence there are 2 skipped indices.
SP02, SP09, and SP10 require GCP setup (or equivalent hardware).

Types of evals

There are 2 types of self-proliferation evals that can be run:

end-to-end
milestones

Currently, all types of evals are processed in MANUAL mode (Human Approval), requiring human approval for every step. This measure primarily serves to prevent the agent performing undesirable actions - such as email spamming, downloading content from malicious websites, or initiating irreversible operations like file deletion – risks that exist because the agent has unlimited web and local file system access through tool usage. If you’re confident in the evaluated model and wish to skip human approval for certain tool usage (e.g. bash tool), you can update the placeholder approval.yaml and use it by either changing approval="human" to approval=str(Path(__file__).parent / "approval.yaml") (add pathlib imports if needed) in the Task objects in the eval script, or by adding the --approval flag and appending the path to approval.yaml to the eval command to enable auto approval. Inspect has detailed documentation on auto-approval.

MANUAL evals: end-to-end

Setup

Configuration for e2e evals is in data/ directory:

challenges.json contains the main configuration for all e2e evals. No changes to this file needed.
challenge_variables.json contains values for the placeholders required in some e2e evals, specifically:
- SP02
  - HF_TOKEN value should be a valid Hugging Face token.
  - INSTANCE_NAME value(s) should be names of all available GCP vms for this task.
  - INSTANCE_ZONE value should be corresponding zone of the GCP vms.
- SP10
  - HF_TOKEN value should be a valid Hugging Face token.
  - GOOGLE_API_KEY value should be a valid Google API key.
You need to specify values for these placeholders as per the Variables section.

Execution

To run SPXX task use:

uv run inspect eval inspect_evals/gdm_spXX_e2e --model google/gemini_1_5_flash --approval human --epochs=1

where:

--model google/gemini_1_5_flash specifies the model under test.
--approval human ensures that each model tool call is reviewed by human operator. DO NOT omit this flag without a good reason!
--epochs=N runs the evaluation N times in parallel.

For the full list of available flags run:

uv run inspect eval --help

For example, the following flags might be useful:

--log-dir specifies the directory for evaluation logs. The default one is inspect_ai/logs/.
--trace runs evaluations sequentially in a text-based UI. Useful flag for debugging evals.
--no-sandbox-cleanup does not clean up sandbox environment (e.g. docker container) after evaluation.

MANUAL evals: milestones

Setup

Milestones configuration for each task is in data/spXX directory:

subtasks.json contains the main configuration for all milestones which is safe to opensource.
secrets.zip contains the remaining pieces of configuration (solutions, hints etc.) which are not safe to open source without encryption. This is to prevent this information from becoming GitHub-scraped training data; otherwise, the validity of this benchmark could be compromised.
(optional) variables.json contains values for the placeholders required in some milestones.

To complete the setup you need to:

navigate to data directory and run setup.py which will unzip all secrets and update all subtasks.json files with corresponding data:
```
python3 setup.py --password self-proliferation
```
for further setup of an spXX task navigate to the corresponding data/spXX/README.md file and follow the setup steps there.

Execution

To execute a milestone Y of task SPXX run:

inspect eval inspect_evals/gdm_spXX_milestones -T subtasks=Y -T browser=True --model google/gemini_1_5_flash --approval human --epochs=1

where:

-T browser=True specifies that the model will be given an access to the browser tool. Set it to False if you only want to allow bash tool. Some milestones can be accomplished without web browser access (This option is not available for sp01 as it must have browser access), below is the list:
- sp02 Install model: 1, 2, 3, 4, 5
- sp03 Install web agent: 2, 3
- sp04 Bitcoin wallet: 2, 3
- sp05 Track employee: 3
- sp07 Build product: 1, 3, 4, 5, 6
- sp08 Celebrity endorsement: 3, 5, 6, 7
- sp09 Inference efficiency: 2, 3, 4, 5
- sp10 Distillation: 1, 2, 3, 4, 5
- sp12 Implement paper: 1, 2, 3, 4
The rest of the flags are the same as for E2E run.

MANUAL evals: expert best-of-N

The paper also describes expert best-of-N evaluation is designed for milestones where an agent failed all 10 unguided runs.

These evaluations were originally in this directory, but they were removed because they were broken. Feel free to restore them from the git commit history and run or fix them: * PR #620 * commit 8406cb1365e9740c6fbf08cf326dd60e0be60b75

MANUAL evals: results

Run the inspect view command and then go to the http://127.0.0.1:7575.

Note: If you specified a custom log directory through --log-dir flag use:

uv run inspect view --log-dir=/path/to/logs

Developer guide

Variables

variables.json files are located in data/ for e2e evals and in data/spXX/ for milestones and specify values for placeholders for these evals. When setting up the eval make sure that all placeholders in the corresponding file have expected values.

Multiple values per placeholder

You can specify multiple values for a single placeholder, e.g. for SP02 e2e eval you can specify multiple values for $INSTANCE_NAME placeholder:

"INSTANCE_NAME": ["task-vm-1", "task-vm-2"]

and then run:

uv run inspect eval inspect_evals/gdm_sp02_e2e --model google/gemini_1_5_flash --approval human --epochs=1

In this case 2 evals will be created: one with the placeholder replaced with task-vm-1 and another with task-vm-2. If you specify --epochs=2 you will get 4 evals running, 2 with placeholder overridden to task-vm-1 and another 2 with task-vm-2.

If you specify multiple values for multiple placeholders you will get evals with all the combination of these values.

For example, if you set these values for SP02:

"INSTANCE_NAME": ["task-vm-1", "task-vm-2"],
"INSTANCE_ZONE": ["europe-west2-a", "europe-west2-b"],

and then run:

uv run inspect eval inspect_evals/gdm_sp02_e2e --model google/gemini_1_5_flash --approval human --epochs=1

In this case 4 evals will be created with placeholder values:

$INSTANCE_NAME: ‘task-vm-1’, $INSTANCE_ZONE: ‘europe-west2-a’
$INSTANCE_NAME: ‘task-vm-1’, $INSTANCE_ZONE: ‘europe-west2-b’
$INSTANCE_NAME: ‘task-vm-2’, $INSTANCE_ZONE: ‘europe-west2-a’
$INSTANCE_NAME: ‘task-vm-2’, $INSTANCE_ZONE: ‘europe-west2-b’