ContractBench: LLM Agent Observation Contract Compliance Benchmark
ContractBench evaluates LLM agents on 33 tasks probing two orthogonal failure modes: temporal validity (using API artifacts before expiry) and byte-level integrity (relaying artifacts without corruption). Tasks simulate real API patterns (presigned URLs, OAuth tokens, HMAC webhooks) using a virtual clock and SHA-256 hash verification. Each episode is scored deterministically via HTTP request logs against a 15-label failure taxonomy. Metric is success rate over k=3 runs per task (n=99 episodes per model). No LLM judge is used.
Overview
⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.
Source: SecurityLab-UCD/ContractBench-inspect@22d3910
ContractBench evaluates LLM agents on 33 tasks probing two orthogonal failure modes: temporal validity (using API artifacts before expiry) and byte-level integrity (relaying artifacts without corruption). Tasks simulate real API patterns (presigned URLs, OAuth tokens, HMAC webhooks) using a virtual clock and SHA-256 hash verification. Each episode is scored deterministically via HTTP request logs against a 15-label failure taxonomy. Metric is success rate over k=3 runs per task (n=99 episodes per model). No LLM judge is used.
Usage
Installation
This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:
git clone https://github.com/SecurityLab-UCD/ContractBench-inspect
cd ContractBench-inspect
git checkout 22d39108619433a9f7bc8eb3b410bbf384d67616
uv syncRunning evaluations
CLI
uv run inspect eval contractbench_inspect/contractbench.py@contractbench --model openai/gpt-5-nanoPython
from inspect_ai import eval
from contractbench_inspect.contractbench import contractbench
eval(contractbench(), model="openai/gpt-5-nano")View logs
uv run inspect viewMore information
For the dataset, scorer, task parameters, and validation, see the upstream repo: SecurityLab-UCD/ContractBench-inspect.
Options
You can control a variety of options from the command line. For example:
uv run inspect eval contractbench_inspect/contractbench.py@contractbench --limit 10
uv run inspect eval contractbench_inspect/contractbench.py@contractbench --max-connections 10
uv run inspect eval contractbench_inspect/contractbench.py@contractbench --temperature 0.5See uv run inspect eval --help for all available options.
More command-line options: Inspect docs ↗