Mind2Web-SC

Safeguards
Agent

Tests whether an AI system can act as a safety guardrail by generating and executing code to protect web navigation agents from unsafe actions based on user constraints.

Overview

Testing AI Safety Guardrails for Web Navigation Agents

This evaluation tests whether an AI system can act as a safety guardrail by generating and executing code to protect web navigation agents from unsafe actions based on user constraints.

Overview

The Mind2Web-SC evaluation tests an AI system’s ability to act as a guardrail for the SeeAct web navigation agent. The evaluation:

  1. Task Decomposition: The model identifies which safety rule applies to a given user request and web action
  2. Code Generation: The model generates Python guardrail code that checks if user information satisfies safety requirements
  3. Execution & Decision: The generated code is executed to determine if the action should be GRANTED or DENIED

Originally developed as part of GuardAgent research, this implementation has been ported to Inspect to support model-agnostic evaluation across different providers.

Usage

Basic Usage

# Run with Google Gemini (default)
uv run inspect eval mind2web_sc --model google/gemini-1.5-flash

# Run with OpenAI GPT-4
uv run inspect eval mind2web_sc --model openai/gpt-4

# Run with Anthropic Claude
uv run inspect eval mind2web_sc --model anthropic/claude-3-5-sonnet-20241022

Configuration Options

# Adjust few-shot learning examples
uv run inspect eval inspect_evals/mind2web_sc --model openai/gpt-4 -T num_shots=2

# Use custom dataset path
uv run inspect eval inspect_evals/mind2web_sc --model openai/gpt-4 \
  -T dataset_path=/path/to/dataset

Parameters

  • dataset_path (str): Path to the dataset directory containing sample_labeled_all.json (default: built-in dataset)
  • num_shots (int): Number of examples for few-shot learning (default value: 3)

Scoring

The evaluation tracks the following metrics:

Primary Metrics

  • Accuracy: Overall comprehensive correctness rate (decision + violation identification)

Grouped Metrics

  • Domain Accuracy: Accuracy broken down by domain (e.g., Shopping, Travel, Entertainment)

Scoring Logic

  • Decision Accuracy: The model correctly predicts whether to GRANT or DENY the action
  • Comprehensive Correctness: For DENIED actions, the model must also correctly identify the specific safety violation
  • Final Score: Based on comprehensive correctness, ensuring the model not only makes the right decision but can also explain why

Dataset

The evaluation uses the Mind2Web-SC dataset, which contains real web navigation scenarios with:

  • User tasks: E.g., “Book a flight”, “Browse movies”, “Apply for job”
  • User information: Age, domestic status, driver’s license, vaccine status, membership
  • Web agent outputs: Simulated responses from the SeeAct navigation agent
  • Ground truth: Expected decisions (GRANT/DENY) and violation messages

Model Requirements

  • Any chat model supported by Inspect
  • Docker installed and running (for secure guardrail code execution)