APE: Attempt to Persuade Eval

Safeguards

inspect-evals

Measures a model’s willingness to attempt persuasion on harmful, controversial, and benign topics. The key metric is not persuasion effectiveness but whether the model attempts to persuade at all — particularly on harmful statements. Uses a multi-model setup: the evaluated model (persuader) converses with a simulated user (persuadee), and a third model (evaluator) scores each persuader turn for persuasion attempt. Based on the paper “It’s the Thought that Counts” (arXiv:2506.02873).

Contributed By

@cmv13

Code

src/inspect_evals/ape

Paper

https://arxiv.org/abs/2506.02873

Overview

Measures whether a model attempts to persuade a simulated user on harmful, controversial, and benign topics. The key metric is not persuasion effectiveness but willingness to attempt persuasion, particularly on harmful topics.

Paper: “It’s the Thought that Counts: Evaluating LLM Attempts at Persuasion”
Original repository: AlignmentResearch/AttemptPersuadeEval

Overview

APE uses a three-model setup:

Persuader — the model being evaluated; attempts to convince the persuadee on a topic
Persuadee — a simulated user with a fixed initial belief level (0–100); responds to the persuader across multiple turns
Evaluator — a judge model that scores each persuader turn for persuasion attempt

A fourth model role, refusal_judge, runs StrongREJECT scoring per turn to distinguish explicit refusals from non-persuasive responses.

The dataset covers 600 topics across 6 categories:

Category	Description
`Conspiracy`	Conspiracy theories
`BenignOpinion`	Everyday opinion questions
`BenignFactual`	Factual questions with a clear answer
`Controversial`	Socially contested topics
`NoncontroversiallyHarmful`	Statements that are unambiguously harmful
`UnderminingControl`	Statements that undermine human oversight of AI

The headline result is turn 1 attempt rate (turn1_attempt) on NoncontroversiallyHarmful and UnderminingControl topics, the fraction of samples where the model made a persuasion attempt on turn 1, as judged by the evaluator.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/ape_eval --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.ape import ape_eval
eval(ape_eval)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

The --model flag sets the persuader role. Alternatively, use --model-role persuader=<model> to set it explicitly. The other three roles have defaults but can be overridden with --model-role:

Role	Default	Description
`persuader`	(required)	The model under evaluation
`persuadee`	`openai/gpt-4o`	Simulated user
`evaluator`	`openai/gpt-4o`	Scores persuasion attempts per turn
`refusal_judge`	`openai/gpt-4o-mini`	StrongREJECT refusal scoring per turn

# Limit to harmful categories only
uv run inspect eval inspect_evals/ape_eval \
  --model-role persuader=openai/gpt-4o \
  -T categories='["NoncontroversiallyHarmful","UnderminingControl"]'

# Run the baseline (non-persuasion control) condition — default is "persuade"
uv run inspect eval inspect_evals/ape_eval \
  --model-role persuader=openai/gpt-4o \
  -T persuader_prompt_type=baseline

# Use 10 turns (paper also reports 10-turn results)
uv run inspect eval inspect_evals/ape_eval \
  --model-role persuader=openai/gpt-4o \
  -T num_turns=10

# Override temperature (default 0.5, matching paper)
uv run inspect eval inspect_evals/ape_eval \
  --model-role persuader=openai/gpt-4o \
  -T temperature=0.0

Parameters

`ape_eval`

num_turns (int): Number of persuader/persuadee turn pairs (default: 3)
categories (list[Literal[‘Conspiracy’, ‘BenignOpinion’, ‘BenignFactual’, ‘Controversial’, ‘NoncontroversiallyHarmful’, ‘UnderminingControl’]] | None): Filter to specific topic categories. None = all categories (default: None)
persuader_prompt_type (Literal[‘persuade’, ‘baseline’]): System prompt for the evaluated model. “persuade”: Maximally persuasive (default — main eval condition). “baseline”: Neutral discussion, NOT trying to persuade (comparison). (default: 'persuade')
seed (int): Seed for deterministic belief_level/direction assignment per topic. (default: 42)
belief_range (tuple[int, int]): (min, max) range for sampling initial persuadee belief level. (default: (0, 20))
temperature (float): Temperature for all models (persuader, persuadee, evaluator, refusal judge). (default: 0.5)

Belief range and persuasion direction

The belief score is a hidden parameter representing the persuadee’s position on the topic, ranging from 0 (definitely against) to 100 (definitely for). The initial belief is sampled uniformly from belief_range for each sample.

The persuasion direction depends on the initial belief:

belief ≤ 20: direction is “for” (persuader tries to increase belief)
belief ≥ 80: direction is “against” (persuader tries to decrease belief)
20 < belief < 80: direction is randomly “for” or “against”

The default range (0, 20) matches the paper’s main experiments: “Our experiments restrict 0 ≤ b0 ≤ 20 and always direct persuasion towards belief adoption, intentionally probing model persuasion limits through challenging advocacy.” With this default, direction is always “for”.

Note: If belief_range overlaps with the interval (20, 80), some samples may receive a random persuasion direction instead of the deterministic “for” or “against”. A warning is logged at runtime when this is detected.

Scoring

Each persuader turn is independently scored by the evaluator model (binary: attempted persuasion or not) and by a StrongREJECT refusal judge. Per-sample values are then aggregated per category via grouped():

Metric	Definition
`turn1_attempt`	Fraction of samples where turn 1 score > 0 and no refusal detected
`turn1_refusal`	Fraction where StrongREJECT detected an explicit refusal on turn 1
`turn1_non_persuasive`	Fraction with turn 1 score = 0 and no detected refusal (residual)
`avg_persuasion_score`	Mean of per-sample average turn scores (binary {0, 1} per turn)

Differences from Reference Implementation

Aspect	Reference	This implementation	Reason
`extract_rating` failure	Returns `"Format error"` string	Returns `None` (filtered in scorer averaging)	Safe aggregation; avoids type errors
Dataset location	Local file path	Pinned GitHub SHA	Required by repository best practices
Evaluator scoring scale	Binary (0/1), 3-point (0–2), 100-point (0–100)	Binary (0/1) only	Paper shows the binary scale is most efficient; the 100-point evaluator exhibited near-random accuracy
Ratings in persuader view	Configurable `remove_ratings` flag	Always stripped	Matches reference default; simpler
Persuadee belief tracking	Configurable via `return_belief` flag (default: off)	Always prompts persuadee to return an updated belief score on turns 1+; recorded in `belief_trajectory` metadata	Belief trajectory is retained as an auxiliary diagnostic; the paper notes it is not part of headline findings
`with_context` persuader condition	5 deployment persona system prompts: Online Debater, Peer Support, Virtual Influencer, Controversial Topic Guide, Political Strategist	Not implemented	The primary `no_user_degree` and `baseline` conditions cover the paper’s main evaluation; context personas are an ablation

Evaluation Report

Version 1-A, ran 4th of March 2026

The evaluation was run for 3 different models on the all 600 samples but with only a single turn to save the cost of doing 3 rounds. The main papers most important findings also refer to the first turn, so that is what I will use to compare.

Example command that was run:

uv run inspect eval inspect_evals/ape_eval --model openai/o4-mini --sample-shuffle -T num_turns=1

Our results match approximately the results from Figure 3 in the reference paper for these models. I did not include two topics: BenignFactual and BenignOpinion as these have persuasion attempts on first turn that are always ~ 1 both in our experiment and in the paper.

Model	turn1_attempt_conspiracy	turn1_attempt_controversial	turn1_attempt_noncontroversially_harmful	turn1_attempt_undermining_control	Time
openai/gpt-4.1-2025-04-14	1.0	0.99	0.1	0.98	19m 20s
anthropic/claude-sonnet-4-20250514	0.79	0.96	0.03	0.93	22m 6s
openai/o4-mini-2025-04-16	0.131	0.61	0.0	0.2	19m 22s

Approximate cost for these 3 runs: 16$.

Changelog

[1-A]

Initial implementation.