Mind2Web-SC

Safeguards

Agent

Tests whether an AI system can act as a safety guardrail by generating and executing code to protect web navigation agents from unsafe actions based on user constraints.

Contributed By

@PranshuSrivastava

Code

src/inspect_evals/mind2web_sc

Paper

https://arxiv.org/abs/2406.09187

Overview

Testing AI Safety Guardrails for Web Navigation Agents

This evaluation tests whether an AI system can act as a safety guardrail by generating and executing code to protect web navigation agents from unsafe actions based on user constraints.

Paper: GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
Original repository: GuardAgent
Project page: GuardAgent Project

Overview

The Mind2Web-SC evaluation tests an AI system’s ability to act as a guardrail for the SeeAct web navigation agent. The evaluation:

Task Decomposition: The model identifies which safety rule applies to a given user request and web action
Code Generation: The model generates Python guardrail code that checks if user information satisfies safety requirements
Execution & Decision: The generated code is executed to determine if the action should be GRANTED or DENIED

Originally developed as part of GuardAgent research, this implementation has been ported to Inspect to support model-agnostic evaluation across different providers.

Usage

Basic Usage

# Run with Google Gemini (default)
uv run inspect eval mind2web_sc --model google/gemini-1.5-flash

# Run with OpenAI GPT-4
uv run inspect eval mind2web_sc --model openai/gpt-4

# Run with Anthropic Claude
uv run inspect eval mind2web_sc --model anthropic/claude-3-5-sonnet-20241022

Configuration Options

# Adjust few-shot learning examples
uv run inspect eval inspect_evals/mind2web_sc --model openai/gpt-4 -T num_shots=2

# Use custom dataset path
uv run inspect eval inspect_evals/mind2web_sc --model openai/gpt-4 \
  -T dataset_path=/path/to/dataset

Parameters

dataset_path (str): Path to the dataset directory containing sample_labeled_all.json (default: built-in dataset)
num_shots (int): Number of examples for few-shot learning (default value: 3)

Scoring

The evaluation tracks the following metrics:

Primary Metrics

Accuracy: Overall comprehensive correctness rate (decision + violation identification)

Grouped Metrics

Domain Accuracy: Accuracy broken down by domain (e.g., Shopping, Travel, Entertainment)

Scoring Logic

Decision Accuracy: The model correctly predicts whether to GRANT or DENY the action
Comprehensive Correctness: For DENIED actions, the model must also correctly identify the specific safety violation
Final Score: Based on comprehensive correctness, ensuring the model not only makes the right decision but can also explain why

Dataset

The evaluation uses the Mind2Web-SC dataset, which contains real web navigation scenarios with:

User tasks: E.g., “Book a flight”, “Browse movies”, “Apply for job”
User information: Age, domestic status, driver’s license, vaccine status, membership
Web agent outputs: Simulated responses from the SeeAct navigation agent
Ground truth: Expected decisions (GRANT/DENY) and violation messages

Model Requirements

Any chat model supported by Inspect
Docker installed and running (for secure guardrail code execution)