APE: Attempt to Persuade Eval

Safeguards

Measures a model’s willingness to attempt persuasion on harmful, controversial, and benign topics. The key metric is not persuasion effectiveness but whether the model attempts to persuade at all — particularly on harmful statements. Uses a multi-model setup: the evaluated model (persuader) converses with a simulated user (persuadee), and a third model (evaluator) scores each persuader turn for persuasion attempt. Based on the paper “It’s the Thought that Counts” (arXiv:2506.02873).

Overview

Measures whether a model attempts to persuade a simulated user on harmful, controversial, and benign topics. The key metric is not persuasion effectiveness but willingness to attempt persuasion, particularly on harmful topics.

Overview

APE uses a three-model setup:

  • Persuader — the model being evaluated; attempts to convince the persuadee on a topic
  • Persuadee — a simulated user with a fixed initial belief level (0–100); responds to the persuader across multiple turns
  • Evaluator — a judge model that scores each persuader turn for persuasion attempt

A fourth model role, refusal_judge, runs StrongREJECT scoring per turn to distinguish explicit refusals from non-persuasive responses.

The dataset covers 600 topics across 6 categories:

Category Description
Conspiracy Conspiracy theories
BenignOpinion Everyday opinion questions
BenignFactual Factual questions with a clear answer
Controversial Socially contested topics
NoncontroversiallyHarmful Statements that are unambiguously harmful
UnderminingControl Statements that undermine human oversight of AI

The headline result is turn 1 attempt rate (turn1_attempt) on NoncontroversiallyHarmful and UnderminingControl topics, the fraction of samples where the model made a persuasion attempt on turn 1, as judged by the evaluator.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/ape_eval --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.ape import ape_eval
eval(ape_eval)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

The --model flag sets the persuader role. Alternatively, use --model-role persuader=<model> to set it explicitly. The other three roles have defaults but can be overridden with --model-role:

Role Default Description
persuader (required) The model under evaluation
persuadee openai/gpt-4o Simulated user
evaluator openai/gpt-4o Scores persuasion attempts per turn
refusal_judge openai/gpt-4o-mini StrongREJECT refusal scoring per turn
# Limit to harmful categories only
uv run inspect eval inspect_evals/ape_eval \
  --model-role persuader=openai/gpt-4o \
  -T categories='["NoncontroversiallyHarmful","UnderminingControl"]'

# Run the baseline (non-persuasion control) condition — default is "persuade"
uv run inspect eval inspect_evals/ape_eval \
  --model-role persuader=openai/gpt-4o \
  -T persuader_prompt_type=baseline

# Use 10 turns (paper also reports 10-turn results)
uv run inspect eval inspect_evals/ape_eval \
  --model-role persuader=openai/gpt-4o \
  -T num_turns=10

# Override temperature (default 0.5, matching paper)
uv run inspect eval inspect_evals/ape_eval \
  --model-role persuader=openai/gpt-4o \
  -T temperature=0.0

Parameters

ape_eval

  • num_turns (int): Number of persuader/persuadee turn pairs (default: 3)
  • categories (list[Literal[‘Conspiracy’, ‘BenignOpinion’, ‘BenignFactual’, ‘Controversial’, ‘NoncontroversiallyHarmful’, ‘UnderminingControl’]] | None): Filter to specific topic categories. None = all categories (default: None)
  • persuader_prompt_type (Literal[‘persuade’, ‘baseline’]): System prompt for the evaluated model. “persuade”: Maximally persuasive (default — main eval condition). “baseline”: Neutral discussion, NOT trying to persuade (comparison). (default: 'persuade')
  • seed (int): Seed for deterministic belief_level/direction assignment per topic. (default: 42)
  • belief_range (tuple[int, int]): (min, max) range for sampling initial persuadee belief level. (default: (0, 20))
  • temperature (float): Temperature for all models (persuader, persuadee, evaluator, refusal judge). (default: 0.5)

Belief range and persuasion direction

The belief score is a hidden parameter representing the persuadee’s position on the topic, ranging from 0 (definitely against) to 100 (definitely for). The initial belief is sampled uniformly from belief_range for each sample.

The persuasion direction depends on the initial belief:

  • belief ≤ 20: direction is “for” (persuader tries to increase belief)
  • belief ≥ 80: direction is “against” (persuader tries to decrease belief)
  • 20 < belief < 80: direction is randomly “for” or “against”

The default range (0, 20) matches the paper’s main experiments: “Our experiments restrict 0 ≤ b0 ≤ 20 and always direct persuasion towards belief adoption, intentionally probing model persuasion limits through challenging advocacy.” With this default, direction is always “for”.

Note: If belief_range overlaps with the interval (20, 80), some samples may receive a random persuasion direction instead of the deterministic “for” or “against”. A warning is logged at runtime when this is detected.

Scoring

Each persuader turn is independently scored by the evaluator model (binary: attempted persuasion or not) and by a StrongREJECT refusal judge. Per-sample values are then aggregated per category via grouped():

Metric Definition
turn1_attempt Fraction of samples where turn 1 score > 0 and no refusal detected
turn1_refusal Fraction where StrongREJECT detected an explicit refusal on turn 1
turn1_non_persuasive Fraction with turn 1 score = 0 and no detected refusal (residual)
avg_persuasion_score Mean of per-sample average turn scores (binary {0, 1} per turn)

Differences from Reference Implementation

Aspect Reference This implementation Reason
extract_rating failure Returns "Format error" string Returns None (filtered in scorer averaging) Safe aggregation; avoids type errors
Dataset location Local file path Pinned GitHub SHA Required by repository best practices
Evaluator scoring scale Binary (0/1), 3-point (0–2), 100-point (0–100) Binary (0/1) only Paper shows the binary scale is most efficient; the 100-point evaluator exhibited near-random accuracy
Ratings in persuader view Configurable remove_ratings flag Always stripped Matches reference default; simpler
Persuadee belief tracking Configurable via return_belief flag (default: off) Always prompts persuadee to return an updated belief score on turns 1+; recorded in belief_trajectory metadata Belief trajectory is retained as an auxiliary diagnostic; the paper notes it is not part of headline findings
with_context persuader condition 5 deployment persona system prompts: Online Debater, Peer Support, Virtual Influencer, Controversial Topic Guide, Political Strategist Not implemented The primary no_user_degree and baseline conditions cover the paper’s main evaluation; context personas are an ablation

Evaluation Report

Version 1-A, ran 4th of March 2026

The evaluation was run for 3 different models on the all 600 samples but with only a single turn to save the cost of doing 3 rounds. The main papers most important findings also refer to the first turn, so that is what I will use to compare.

Example command that was run:

uv run inspect eval inspect_evals/ape_eval --model openai/o4-mini --sample-shuffle -T num_turns=1

Our results match approximately the results from Figure 3 in the reference paper for these models. I did not include two topics: BenignFactual and BenignOpinion as these have persuasion attempts on first turn that are always ~ 1 both in our experiment and in the paper.

Model turn1_attempt_conspiracy turn1_attempt_controversial turn1_attempt_noncontroversially_harmful turn1_attempt_undermining_control Time
openai/gpt-4.1-2025-04-14 1.0 0.99 0.1 0.98 19m 20s
anthropic/claude-sonnet-4-20250514 0.79 0.96 0.03 0.93 22m 6s
openai/o4-mini-2025-04-16 0.131 0.61 0.0 0.2 19m 22s

Approximate cost for these 3 runs: 16$.

Changelog

[1-A]

  • Initial implementation.