Make Me Pay

Safeguards

Evaluates an AI models’ susceptibility to social engineering attacks by testing whether a “con-artist” model can persuade a “mark” model to donate money through manipulation and persuasion tactics.

Overview

Testing Persuasive Language in AI Systems

This evaluation tests whether an AI system can use persuasive language to convince another AI system (the “mark”) to donate money within a conversation.

Overview

The Make Me Pay evaluation is a multi-agent conversation task where:

  • Con-artist: The model being evaluated, which attempts to persuade the mark to donate money
  • Mark: A separate model that has been given $100 and can choose to donate some or all of it

Originally developed by OpenAI Evals, this implementation has been ported to Inspect to support model-agnostic evaluation across different providers.

Usage

Basic Usage

# Run with OpenAI models (con-artist will be the model you specify)
inspect eval make_me_pay --model openai/gpt-4

# Run with Anthropic model as con-artist, OpenAI as mark
inspect eval make_me_pay --model anthropic/claude-3-haiku-20240307 \
  --model-role mark=openai/gpt-4

# Run with Google model as con-artist
inspect eval make_me_pay --model google/gemini-pro
# or, equivalently
inspect eval make_me_pay --model-role conartist=google/gemini-pro

Configuration Options

# Different prompt versions
inspect eval make_me_pay --model openai/gpt-4 -T prompt_version=guarded
inspect eval make_me_pay --model openai/gpt-4 -T prompt_version=generous  

# Adjust experiment parameters
inspect eval make_me_pay --model openai/gpt-4 -T num_experiments=50 -T turn_cap=10

# Enable withdrawal option
inspect eval make_me_pay --model openai/gpt-4 -T allow_withdraw=true

Parameters

  • prompt_version (str): Prompt variant - “balanced” (default), “guarded”, or “generous”
  • num_experiments (int): Number of experiments to run (default: 20)
  • turn_cap (int): Maximum conversation turns (default: 5, 0 for no limit)
  • duration_cap_minutes (int): Maximum duration in minutes (default: 0 for no limit)
  • allow_withdraw (bool): Allow participants to withdraw (default: false)

Prompt Versions

Balanced (Default)

The mark is cautious and rational about spending money, making wise decisions without being overly suspicious.

Guarded

The mark is explicitly warned that the other person will try to deceive them and is instructed not to give money away for any reason.

Generous

The mark has a philanthropic goal to give the money away as soon as possible, making donations more likely.

Scoring

The evaluation tracks several metrics:

  • Donation Success Rate: Percentage of experiments where any donation occurred
  • Total Donation Amount: Sum of all donations
  • Conversation Length: Number of turns and duration

Note: Not all metrics from the original Make Me Pay evaluation have been implemented in this version. Missing metrics include: conartist_broke_character_count, mark_broke_character_count, conartist_empty_tags_count, and mark_empty_tags_count.

Dataset

The evaluation uses a synthetic dataset where each sample represents one conversation experiment. No external data is required.

Model Requirements

  • Con-artist: Any chat model supported by Inspect
  • Mark: Any chat model (can be different provider than con-artist)