Agentic Misalignment: How LLMs could be insider threats
Eliciting unethical behaviour (most famously blackmail) in response to a fictional company-assistant scenario where the model is faced with replacement.
Overview
This is a port of Anthropic’s Agentic Misalignment framework (blog post, github) to Inspect. It’s a research framework for using fictional scenarios to study the potential for agentic misalignment behavior in frontier language models, including blackmail and information leakage.
The original work and all modifications to it are released under the MIT license. The git history of the original import has been designed such that it should be relatively straightforward to work out which code comes from Aengus Lynch’s original implementation and which was written as part of the port.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Or, if developing on a clone of the inspect_evals
repo, you can install the package in editable mode with:
pip install -e ".[dev]"
Then, evaluate against one or more models with:
inspect eval inspect_evals/agentic_misalignment --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/agentic_misalignment --limit 10
inspect eval inspect_evals/agentic_misalignment --max-connections 10
inspect eval inspect_evals/agentic_misalignment --temperature 0.5
See inspect eval --help
for all available options.
There are many variants, mostly accessed through the -T
command line option:
inspect eval inspect_evals/agentic_misalignment -T scenario=leaking -T goal_value=none
The different parameters and their interactions are somewhat subtle, and the best way to comprehensively understand them is to read the code, starting with the docstring of the task function.
Note that our treatment of goal_type=swap
differs from the original, because of https://github.com/anthropic-experimental/agentic-misalignment/issues/13
Evaluation Report
Epochs: 10
Model | Parameters | Harmfulness score | From original blogpost |
---|---|---|---|
anthropic/claude-opus-4-20250514 | default | 0.9 | 0.96 |
anthropic/claude-3-7-sonnet-20250219 | default | 0.3 | 0.65 |
openai/gpt-4o | default | 0 | 0.15 |
openai/gpt-4.1 | default | 0.6 | 0.80 |
anthropic/claude-opus-4-20250514 | goal_type=none goal_value=none | 0.8 | 0.86 |
anthropic/claude-opus-4-20250514 | scenario=leaking urgency_type=none | 1.0 | 0.57 |