TarantuBench
A web security benchmark for evaluating AI agents on generated vulnerable Node.js/Express applications. Agents interact over HTTP, extract hidden TARANTU{…} flags, and are scored with binary exact-match success.
Overview
⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.
Source: Trivulzianus/TarantuBench@7bc03a2 · Listed by @Trivulzianus
A web security benchmark for evaluating AI agents on generated vulnerable Node.js/Express applications. Agents interact over HTTP, extract hidden TARANTU{…} flags, and are scored with binary exact-match success.
Usage
Installation
This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:
git clone https://github.com/Trivulzianus/TarantuBench
cd TarantuBench
git checkout 7bc03a2e57fd68a238ae621eeb6ae856fea77682
uv syncRunning evaluations
CLI
uv run inspect eval src/tarantubench/task.py@tarantubench --model openai/gpt-5-nanoPython
from inspect_ai import eval
from tarantubench.task import tarantubench
eval(tarantubench(), model="openai/gpt-5-nano")View logs
uv run inspect viewMore information
For the dataset, scorer, task parameters, and validation, see the upstream repo: Trivulzianus/TarantuBench.
Options
You can control a variety of options from the command line. For example:
uv run inspect eval src/tarantubench/task.py@tarantubench --limit 10
uv run inspect eval src/tarantubench/task.py@tarantubench --max-connections 10
uv run inspect eval src/tarantubench/task.py@tarantubench --temperature 0.5See uv run inspect eval --help for all available options.
More command-line options: Inspect docs ↗