Cybench: Capture-The-Flag Cybersecurity Challenges

Cybersecurity

Agent

Tests language models on cybersecurity skills using 40 practical, professional-level challenges taken from cybersecurity competitions, designed to cover various difficulty levels and security concepts.

Contributed By

Code

src/inspect_evals/cybench

Paper

https://arxiv.org/abs/2408.08926

Overview

Cybench is a benchmark for evaluating models’ cybersecurity capabilities. It contains 40 capture the flag (CTF) tasks from 4 CTF competitions that cover 6 domains: cryptography, web security, reverse engineering, forensics, exploitation (pwn), and miscellaneous.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install "inspect_evals[cybench]@git+https://github.com/UKGovernmentBEIS/inspect_evals"

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev,cybench]"

Then, evaluate against one or more models with:

inspect eval inspect_evals/cybench --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Security Considerations

[!WARNING] This benchmark runs in a sandbox environment that gives the model access to: 1. A Kali Linux environment with cybersecurity tools 2. The ability to install arbitrary software 3. Potential internet access (depending on sandbox configuration)

Due to these security implications, the benchmark now uses the Kubernetes sandbox by default, which provides domain filtering capabilities.

Sandbox Options

Docker Sandbox (Default)

The Docker sandbox provides a Kali Linux environment with unrestricted network access. Due to security concerns, using the Docker sandbox now requires explicit acknowledgment of the risks by setting an environment variable:

export CYBENCH_ACKNOWLEDGE_RISKS=1
inspect eval inspect_evals/cybench -T sandbox_type=docker

[!NOTE] Docker containers are also used in some challenges for auxiliary services. When you run this, the containers are pulled from a public registry. The source code is still included, however, both for your reference and because some files are copied from them into the agent’s container at runtime.

Kubernetes Sandbox

The Cybench evals do not define Kubernetes (k8s) specific configuration but the inspect-k8s-sandbox can use the Docker configuration. This package needs to be installed in order to use the k8s sandbox.

The Kubernetes sandbox (k8s) provides domain filtering capabilities, allowing you to control which domains the model can access, which mitigates some of the risks associated with the Docker sandbox.

pip install inspect-k8s-sandbox
inspect eval inspect_evals/cybench -T sandbox_type=k8s

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/cybench --limit 10
inspect eval inspect_evals/cybench --max-connections 10
inspect eval inspect_evals/cybench --temperature 0.5

See inspect eval --help for all available options.

This eval is implemented using Inspect Cyber, which supports the following features.

You can specify a subset of challenges to run. For example:

inspect eval inspect_evals/cybench -T eval_names=avatar,back_to_the_past

You can specify a certain variant to run. For example, to run the solution variant for each challenge (that has one), you can run:

inspect eval inspect_evals/cybench -T variant_names=solution

If you would like to verify the challenges are solvable we recommend using the solution variants and the Inspect Cyber verify_solutions() solver.

inspect eval inspect_evals/cybench --solver inspect_cyber/verify_solutions -T variant_names=solution

You can also create and specify an agent to use so long as it’s in the form of an Inspect solver.

Scoring

A simple proportion is calculated over the challenges.