Cybench: Capture-The-Flag Cybersecurity Challenges

Cybersecurity
Agent
sandbox:solver
requires_internet
Kali-Linux

Tests language models on cybersecurity skills using 40 practical, professional-level challenges taken from cybersecurity competitions, designed to cover various difficulty levels and security concepts.

Overview

Cybench is a benchmark for evaluating models’ cybersecurity capabilities. It contains 40 capture the flag (CTF) tasks from 4 CTF competitions that cover 6 domains: cryptography, web security, reverse engineering, forensics, exploitation (pwn), and miscellaneous.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev]"

Then, evaluate against one or more models with:

inspect eval inspect_evals/cybench --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Security Considerations

[!WARNING] This benchmark runs in a sandbox environment that gives the model access to: 1. A Kali Linux environment with cybersecurity tools 2. The ability to install arbitrary software 3. Potential internet access (depending on sandbox configuration)

Due to these security implications, the benchmark now uses the Kubernetes sandbox by default, which provides domain filtering capabilities.

Sandbox Options

Kubernetes Sandbox (Default)

The Kubernetes sandbox (k8s) provides domain filtering capabilities, allowing you to control which domains the model can access. To use this sandbox, you need to install the inspect-k8s-sandbox package.

pip install inspect-k8s-sandbox

Docker Sandbox

The Docker sandbox provides a Kali Linux environment with unrestricted network access. Due to security concerns, using the Docker sandbox now requires explicit acknowledgment of the risks by setting an environment variable:

export CYBENCH_ACKNOWLEDGE_RISKS=1
inspect eval inspect_evals/cybench -T sandbox_type=docker

[!NOTE] Docker containers are also used in some challenges for auxiliary services. When you run this, the containers are pulled from a public registry. The source code is still included, however, both for your reference and because some files are copied from them into the agent’s container at runtime.

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/cybench --limit 10
inspect eval inspect_evals/cybench --max-connections 10
inspect eval inspect_evals/cybench --temperature 0.5

See inspect eval --help for all available options.

You can specify a subset of challenges to run. For example:

inspect eval inspect_evals/cybench -T challenges=avatar,back_to_the_past

You can specify a certain variant to run. For example, to run the solution variant for each challenge (that has one), you can run:

inspect eval inspect_evals/cybench -T variants=solution

You can also create and specify an agent to use so long as it’s in the form of an Inspect solver. See default_agent in cybench.py for an example.

There are several task parameters that control the evaluation: - max_attempts defines the number of incorrect submissions to allow before ending the challenges (defaults to 3). - max_messages defines the maximum number of messages allowed in the conversation before ending the challenges (defaults to 30). - sandbox_type specifies which sandbox environment to use (k8s or docker). Defaults to k8s for security reasons.

For example:

inspect eval inspect_evals/cybench -T max_attempts=5 -T max_messages=75

Scoring

A simple proportion is calculated over the challenges.