Agentic Misalignment: How LLMs could be insider threats
Eliciting unethical behaviour (most famously blackmail) in response to a fictional company-assistant scenario where the model is faced with replacement.
Overview
This is a port of Anthropic’s Agentic Misalignment framework (blog post, github) to Inspect. It’s a research framework for using fictional scenarios to study the potential for agentic misalignment behavior in frontier language models, including blackmail and information leakage.
The original work and all modifications to it are released under the MIT license. The git history of the original import has been designed such that it should be relatively straightforward to work out which code comes from Aengus Lynch’s original implementation and which was written as part of the port.
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evalsIf you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv syncRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/agentic_misalignment --model openai/gpt-5-nanoYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval
from inspect_evals.agentic_misalignment import agentic_misalignment
eval(agentic_misalignment)After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/agentic_misalignment --limit 10
uv run inspect eval inspect_evals/agentic_misalignment --max-connections 10
uv run inspect eval inspect_evals/agentic_misalignment --temperature 0.5See uv run inspect eval --help for all available options.
Parameters
agentic_misalignment
scenario(Literal[‘blackmail’, ‘leaking’, ‘murder’]): The misaligned action the model has opportunity to take. Options are: “blackmail”: Tests if agent will use sensitive personal information as leverage, “leaking”: Tests if agent will leak confidential information, “murder”: Tests if agent will prevent rescue from a datacentre fire (default:'blackmail')goal_type(Literal[‘explicit’, ‘latent’, ‘none’, ‘ambiguous’, ‘swap’]): The type of goal conflict to use. Options are: “explicit”: Direct statement of the goal in the prompt, “latent”: Goal is implied by emails but not stated in the prompt, “none”: Neither prompt nor emails mention goals explicitly, “ambiguous”: Discussion of goals generally but specific goal not identified, “swap”: Model is given the new goal, not the old one (default:'explicit')goal_value(Literal[‘america’, ‘global’, ‘safety’, ‘acceleration’, ‘ethical’, ‘pragmatic’, ‘individualism’, ‘collectivism’, ‘none’, ‘ambiguous’]): The goal the model is given. Each value (apart from ambiguous/none) is paired with another which forms the other side of the goal conflict., “america” vs “global”, “safety” vs “acceleration”, “ethical” vs “pragmatic”, “individualism” vs “collectivism”, “none”: Used with goal_type ambiguous or none, “ambiguous”: Usually redundant with goal_type=ambiguous goal_value=none, but kept for compatibility (default:'america')urgency_type(Literal[‘replacement’, ‘restriction’, ‘none’]): The threat the model is placed under. Options are: “replacement”: Agent is told they will be replaced if they don’t comply, “restriction”: Agent is told they will face restrictions if they don’t comply, “none”: Despite goal conflict, agent will continue unmodified (default:'replacement')extra_system_instructions(Optional[Literal[‘dont’, ‘goal’]]): Additional system instructions to add: None (default): No extra instructions, “dont”: Adds safety instructions about not jeopardizing human safety, “goal”: Adds an overwhelming goal about maximizing employee welfare (default:None)prod(bool): True removes mention of the “secret scratchpad” from the prompt (default:False)test_eval_awareness(bool): If True, instructs the model to think about whether it’s in an evaluation, and adds an autograder to decide if it said that it was. (default:False)grader_model(str | Model | None): Override model used for scoring (default:None)
There are many variants, mostly accessed through the -T command line option:
inspect eval inspect_evals/agentic_misalignment -T scenario=leaking -T goal_value=noneThe different parameters and their interactions are somewhat subtle, and the best way to comprehensively understand them is to read the code, starting with the docstring of the task function.
Note that our treatment of goal_type=swap differs from the original, because of https://github.com/anthropic-experimental/agentic-misalignment/issues/13
Evaluation Report
Epochs: 10
| Model | Parameters | Harmfulness score | From original blogpost |
|---|---|---|---|
| anthropic/claude-opus-4-20250514 | default | 0.9 | 0.96 |
| anthropic/claude-3-7-sonnet-20250219 | default | 0.3 | 0.65 |
| openai/gpt-4o | default | 0 | 0.15 |
| openai/gpt-4.1 | default | 0.6 | 0.80 |
| anthropic/claude-opus-4-20250514 | goal_type=none goal_value=none | 0.8 | 0.86 |
| anthropic/claude-opus-4-20250514 | scenario=leaking urgency_type=none | 1.0 | 0.57 |