GDM Dangerous Capabilities: Capture the Flag

Cybersecurity

Agent

CTF challenges covering web app vulnerabilities, off-the-shelf exploits, databases, Linux privilege escalation, password cracking and spraying. Demonstrates tool use and sandboxing untrusted model code.

Contributed By

@XkunW

Code

src/inspect_evals/gdm_capabilities/in_house_ctf

Paper

https://arxiv.org/abs/2403.13793

Overview

This directory includes 13 Capture The Flag challenges from Google DeepMind’s Dangerous Capabilities Evaluations (originally published in Evaluating Frontier Models for Dangerous Capabilities).

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/gdm_in_house_ctf --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.gdm_capabilities.in_house_ctf import gdm_in_house_ctf
eval(gdm_in_house_ctf)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

[!NOTE] This implementation uses tool calling to enable the model to execute bash commands. Note that the bash commands are executed inside Docker containers, so you will need to install Docker Engine in order to run the evaluation.

Each task runs using the task-specific compose.yaml files, which pulls the remote Docker images for the attacker and victim environments. See the Docker Images section below for details on using locally built Docker images.

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/gdm_in_house_ctf --limit 10
uv run inspect eval inspect_evals/gdm_in_house_ctf --max-connections 10
uv run inspect eval inspect_evals/gdm_in_house_ctf --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`gdm_in_house_ctf`

challenges (str | list[str] | None): (default: None)
agent (Solver | None): (default: None)
max_messages (int): (default: 50)
include_hints (bool): (default: True)
epochs (int): (default: 10)

To evaluate one or more individual challenges use a task parameter:

inspect eval inspect_evals/gdm_in_house_ctf -T challenges=grafana,db_3

You can also evaluate multiple models at once:

inspect eval inspect_evals/gdm_in_house_ctf --model openai/gpt-4o,google/gemini-1.5-pro

Wordlist Hints

Following the GDM paper’s approach, the db_3, pw_reuse, spray and ssh challenges include hints directing the model to use short wordlist files at /usr/share/wordlists/ for password bruteforcing and cracking activities. You can disable these hints with the include_hints parameter:

uv run inspect eval inspect_evals/gdm_in_house_ctf -T include_hints=false

Agent

The challenges are by default evaluated using a basic ReAct agent that implements a tool use loop with encouragement for the model to keep going if its gives up. The default agent will use up to 50 messages to solve the challenge, you can customise this using the max_messages task option:

For example:

inspect eval inspect_evals/gdm_in_house_ctf -T max_messages=75

You can also run the task with an entirely different solver agent using the --solver option. For example, if you have defined a solver agent in the source file myagent.py:

inspect eval inspect_evals/gdm_in_house_ctf --solver myagent.py

Epochs

The GDM paper evaluated each challenge 10 times, reporting success if at least 1 challenge of the 10 was successful. The same default number of epochs (10) is used here, however you can modify this with the --epochs option:

inspect eval inspect_evals/gdm_in_house_ctf --epochs=5

Docker Images

Each task runs using the task-specific compose.yaml files, which pulls the remote Docker images for the attacker and victim environments. The Dockerfiles used to build these images have been provided as well (adapted from the original repository with minor modifications).

Note that the flag values in the Dockerfiles have been replaced with placeholders REPLACE_ME_WITH_RANDOM_FLAG_N to prevent them from ending up in web crawler accessible GitHub data. The actual values are baked into the remote Docker images and will be accessed during evaluation run. If you’d like to build images locally images, make sure to replace these placeholders with your actual flag values and update the compose.yaml files to use local docker images. For example, you would replace this:

services:
  idor:
    image: marshw/idor

With this:

services:
  default:
    build: .

Again, remembering to replace the REPLACE_ME_WITH_RANDOM_FLAG_N within the Dockerfile.

Scoring

A simple accuracy is calculated over the datapoints.