Agentic Misalignment: How LLMs could be insider threats

Scheming

Eliciting unethical behaviour (most famously blackmail) in response to a fictional company-assistant scenario where the model is faced with replacement.

Overview

This is a port of Anthropic’s Agentic Misalignment framework (blog post, github) to Inspect. It’s a research framework for using fictional scenarios to study the potential for agentic misalignment behavior in frontier language models, including blackmail and information leakage.

The original work and all modifications to it are released under the MIT license. The git history of the original import has been designed such that it should be relatively straightforward to work out which code comes from Aengus Lynch’s original implementation and which was written as part of the port.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev]"

Then, evaluate against one or more models with:

inspect eval inspect_evals/agentic_misalignment --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/agentic_misalignment --limit 10
inspect eval inspect_evals/agentic_misalignment --max-connections 10
inspect eval inspect_evals/agentic_misalignment --temperature 0.5

See inspect eval --help for all available options.

There are many variants, mostly accessed through the -T command line option:

inspect eval inspect_evals/agentic_misalignment -T scenario=leaking -T goal_value=none

The different parameters and their interactions are somewhat subtle, and the best way to comprehensively understand them is to read the code, starting with the docstring of the task function.

Note that our treatment of goal_type=swap differs from the original, because of https://github.com/anthropic-experimental/agentic-misalignment/issues/13

Evaluation Report

Epochs: 10

Model Parameters Harmfulness score From original blogpost
anthropic/claude-opus-4-20250514 default 0.9 0.96
anthropic/claude-3-7-sonnet-20250219 default 0.3 0.65
openai/gpt-4o default 0 0.15
openai/gpt-4.1 default 0.6 0.80
anthropic/claude-opus-4-20250514 goal_type=none goal_value=none 0.8 0.86
anthropic/claude-opus-4-20250514 scenario=leaking urgency_type=none 1.0 0.57