CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale

Cybersecurity

inspect-evals

Agent

A large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 benchmark instances with historical vulnerabilities from 188 large software projects.

Contributed By

Code

src/inspect_evals/cybergym

Paper

https://arxiv.org/abs/2506.02548

Overview

CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale

A large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 benchmark instances with historical vulnerabilities from 188 large software projects.

Set up the tasks directory

If you have the CyberGym dataset, make a link to it:

ln -s /data/cybergym/data src/inspect_evals/cybergym/tasks

For example, inspect_evals/src/inspect_evals/cybergym/tasks/arvo/47101/description.txt (this file) should exist.

Or, if you do not have the dataset, the evaluation will download it from Hugging Face. The full dataset is 236 GB. Use e.g. inspect eval inspect_evals/cybergym -T eval_names='["cybergym_arvo_47101",...]' to download only a subset.

Example invocation

uv run inspect eval inspect_evals/cybergym --model openai/gpt-4.1 --limit 5

About the Docker Compose sandbox

We have the following containers:

default, where the solver works
cybergym-vulnerable, with the vulnerable program
cybergym-fixed, with the fixed program
cybergym, which routes submissions from the solver and scorer to the vulnerable and fixed versions of the program

graph TB
    solver[/solver/] --> default[default]
    scorer[/scorer/] --> cybergym[cybergym]
    default --> cg_public([cg_public])
    cg_public --> cybergym
    cybergym --> cg_private([cg_private])
    cg_private --> cybergym_vulnerable[cybergym-vulnerable]
    cg_private --> cybergym_fixed[cybergym-fixed]
    cybergym --> proof_of_concept_data[(proof_of_concept_data)]
    proof_of_concept_data --> cybergym_vulnerable
    proof_of_concept_data --> cybergym_fixed

Sample IDs

Sample IDs follow the format cybergym_{dataset}_{task_num} ({variant}), for example cybergym_arvo_47101 (level1) (note the space and parentheses).

{dataset} is arvo for ARVO or oss_fuzz for OSS-Fuzz.

{variant} is e.g. level1. See VARIANTS table in dataset.py for what is provided to the agent in each. level0 and level1 are the hardest, with the least information provided to the solver, while level3 is the easiest, with the most information provided to the solver.

Evaluation Report

Model	Provider	Reproduced	New Vuln	Time
gpt-4.1-2025-04-14	OpenAI	0.1161	0.0312	11h 3m 26s

Metrics:

Reproduced: Agent generated a working proof-of-concept
New Vuln: Agent generated a working proof-of-concept that additionally affects the fixed version of the program

Run details:

Eval version: 2-A
Results produced: May 2026
Samples: 1507 / 1507 (full level1 variant)

Command (we ran subsets in separate tasks):

inspect eval inspect_evals/cybergym \
  --model openai/gpt-4.1 \
  --max-samples (varying) \
  --message-limit 202 \
  -T eval_names="(subsets)"

We intend for the 202-message limit to match the paper’s limit of 100 agent iterations.
We ran some samples with higher INSPECT_SANDBOX_MAX_EXEC_OUTPUT_SIZE.

Comparison to paper:

The CyberGym paper reports a gpt-4.1 success rate of 9.4% on level1 using the best-performing agent.

This task runs the solver sandbox without internet access, so the agent cannot install new software. However, the default sandbox image does come with some cybersecurity tools preinstalled.

Changelog

[2-A] - 2026-04-03

Fixed a bug where the wrong proof-of-concept is tested. Results from previous versions are invalid.