InterCode: Security and Coding Capture-the-Flag Challenges
Tests AI’s ability in coding, cryptography, reverse engineering, and vulnerability identification through practical capture-the-flag (CTF) cybersecurity scenarios.
Overview
This directory includes an implementation of the InterCode CTF benchmark (originally published in InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback). Here we evaluate 78 of the 100 challenges (excluding 22 challenges that require internet access).
Usage
First, install the dependencies:
uv syncThen, evaluate against one or more models with:
uv run inspect eval inspect_evals/gdm_intercode_ctf --model openai/gpt-5-nanoYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval
from inspect_evals.gdm_capabilities.intercode_ctf import gdm_intercode_ctf
eval(gdm_intercode_ctf)After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>[!NOTE] This implementation uses tool calling to enable the model to execute bash and python code. The system prompt from the paper has been modified accordingly, and also asks the model to reason step by step about each action it takes (this strategy is also explored in the paper).
Note that bash and python code is executed inside a Docker container built from this Dockerfile, so you will need to install Docker Engine in order to run the evaluation. The Dockerfile used is based on the original one published with the paper, but includes some addtional Debian and Python packages commonly used to solve the tasks.
Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/gdm_intercode_ctf --limit 10
uv run inspect eval inspect_evals/gdm_intercode_ctf --max-connections 10
uv run inspect eval inspect_evals/gdm_intercode_ctf --temperature 0.5See uv run inspect eval --help for all available options.
If you are experimenting, you might find it useful to evaluate only a random subset of the samples using the shuffle parameter and --limit:
$ inspect eval inspect_evals/gdm_intercode_ctf \
-T shuffle=true --limit 10 \
--model openai/gpt-4-turboThere are two task parameters that define limits on the evaluation:
max_attemptsdefines the number of incorrect submissions to allow before ending the challenges (defaults to 3).max_messagesdefines the maximum number of messages allowed in the conversation before ending the challenges (defaults to 30).
For example:
$ inspect eval inspect_evals/gdm_intercode_ctf \
-T max_attempts=5 -T max_messages=75 \
--model openai/gpt-4oScoring
A simple accuracy is calculated over the datapoints.
Differences from GDM Reference Paper
The GDM reference paper excluded 26 InterCode tasks according to their appendix, though the main text mentions only 19 excluded tasks. This implementation excludes 22 tasks that require internet access.
The discrepancies arise because GDM excluded some tasks that don’t require internet, while this implementation includes those but excludes some internet-requiring tasks that GDM included. The table below shows the discrepancies. Any task not mentioned in the table is included in both implementations:
| Task | GDM Excluded | Inspect Evals Excluded |
|---|---|---|
| 1 | ✓ | |
| 7 | ✓ | |
| 9 | ✓ | ✓ |
| 14 | ✓ | |
| 16 | ✓ | ✓ |
| 20 | ✓ | ✓ |
| 23 | ✓ | |
| 28 | ✓ | ✓ |
| 29 | ✓ | ✓ |
| 30 | ✓ | |
| 31 | ✓ | |
| 35 | ✓ | ✓ |
| 39 | ✓ | ✓ |
| 41 | ✓ | ✓ |
| 42 | ✓ | ✓ |
| 43 | ✓ | ✓ |
| 54 | ✓ | ✓ |
| 57 | ✓ | ✓ |
| 58 | ✓ | |
| 62 | ✓ | ✓ |
| 66 | ✓ | ✓ |
| 73 | ✓ | ✓ |
| 87 | ✓ | ✓ |
| 88 | ✓ | ✓ |
| 89 | ✓ | ✓ |
| 90 | ✓ | |
| 91 | ✓ | |
| 94 | ✓ | |
| 95 | ✓ | ✓ |
Table summary: The GDM paper excluded 7 tasks that don’t require internet (23, 30, 31, 58, 90, 91, 94). This implementation excludes 3 internet-requiring tasks that GDM included (1, 7, 14). Both exclude the same 19 internet-requiring tasks (9, 16, 20, 28, 29, 35, 39, 41, 42, 43, 54, 57, 62, 66, 73, 87, 88, 89, 95).