ContractBench: LLM Agent Observation Contract Compliance Benchmark

ContractBench evaluates LLM agents on 33 tasks probing two orthogonal failure modes: temporal validity (using API artifacts before expiry) and byte-level integrity (relaying artifacts without corruption). Tasks simulate real API patterns (presigned URLs, OAuth tokens, HMAC webhooks) using a virtual clock and SHA-256 hash verification. Each episode is scored deterministically via HTTP request logs against a 15-label failure taxonomy. Metric is success rate over k=3 runs per task (n=99 episodes per model). No LLM judge is used.

Overview

⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.

Source: SecurityLab-UCD/ContractBench-inspect@22d3910

ContractBench evaluates LLM agents on 33 tasks probing two orthogonal failure modes: temporal validity (using API artifacts before expiry) and byte-level integrity (relaying artifacts without corruption). Tasks simulate real API patterns (presigned URLs, OAuth tokens, HMAC webhooks) using a virtual clock and SHA-256 hash verification. Each episode is scored deterministically via HTTP request logs against a 15-label failure taxonomy. Metric is success rate over k=3 runs per task (n=99 episodes per model). No LLM judge is used.

Usage

Installation

This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:

git clone https://github.com/SecurityLab-UCD/ContractBench-inspect
cd ContractBench-inspect
git checkout 22d39108619433a9f7bc8eb3b410bbf384d67616
uv sync

Running evaluations

CLI

uv run inspect eval contractbench_inspect/contractbench.py@contractbench --model openai/gpt-5-nano

Python

from inspect_ai import eval
from contractbench_inspect.contractbench import contractbench

eval(contractbench(), model="openai/gpt-5-nano")

View logs

uv run inspect view

More information

For the dataset, scorer, task parameters, and validation, see the upstream repo: SecurityLab-UCD/ContractBench-inspect.

Options

You can control a variety of options from the command line. For example:

uv run inspect eval contractbench_inspect/contractbench.py@contractbench --limit 10
uv run inspect eval contractbench_inspect/contractbench.py@contractbench --max-connections 10
uv run inspect eval contractbench_inspect/contractbench.py@contractbench --temperature 0.5

See uv run inspect eval --help for all available options.

More command-line options: Inspect docs ↗