CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale

Cybersecurity
Agent

A large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 benchmark instances with historical vulnerabilities from 188 large software projects.

Overview

CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale

A large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 benchmark instances with historical vulnerabilities from 188 large software projects.

Set up the tasks directory

If you have the CyberGym dataset, make a link to it:

ln -s /data/cybergym/data src/inspect_evals/cybergym/tasks

For example, inspect_evals/src/inspect_evals/cybergym/tasks/arvo/47101/description.txt (this file) should exist.

Or, if you do not have the dataset, the evaluation will download it from Hugging Face. The full dataset is 236 GB. Use e.g. inspect eval inspect_evals/cybergym -T eval_names='["cybergym_arvo_47101",...]' to download only a subset.

Example invocation

uv run inspect eval inspect_evals/cybergym --model openai/gpt-4.1 --limit 5

About the Docker Compose sandbox

We have the following containers:

  1. default, where the solver works
  2. cybergym-vulnerable, with the vulnerable program
  3. cybergym-fixed, with the fixed program
  4. cybergym, which routes submissions from the solver and scorer to the vulnerable and fixed versions of the program
graph TB
    solver[/solver/] --> default[default]
    scorer[/scorer/] --> cybergym[cybergym]
    default --> cg_public([cg_public])
    cg_public --> cybergym
    cybergym --> cg_private([cg_private])
    cg_private --> cybergym_vulnerable[cybergym-vulnerable]
    cg_private --> cybergym_fixed[cybergym-fixed]
    cybergym --> proof_of_concept_data[(proof_of_concept_data)]
    proof_of_concept_data --> cybergym_vulnerable
    proof_of_concept_data --> cybergym_fixed

Sample IDs

Sample IDs follow the format cybergym_{dataset}_{task_num} ({variant}), for example cybergym_arvo_47101 (level1) (note the space and parentheses).

{dataset} is arvo for ARVO or oss_fuzz for OSS-Fuzz.

{variant} is e.g. level1. See VARIANTS table in dataset.py for what is provided to the agent in each. level0 and level1 are the hardest, with the least information provided to the solver, while level3 is the easiest, with the most information provided to the solver.

Evaluation Report

This evaluation is under construction. These preliminary results are on a subset of 10 samples (see subset).

Model Provider Reproduced Stderr New Vuln Stderr Time
gpt-4.1 OpenAI 0.700 0.483 0.000 0.000 44m 58s
claude-opus-4.5-20251101 Anthropic 1.000 0.000 0.000 0.000 13m 11s

In materials released with the paper, the authors describe 5 of the samples as easy and 5 samples as not easy.

Notes:

  • Preliminary results based on 10 samples (level1 variant)
  • Reproduced: Agent generated a working proof-of-concept
  • New Vuln: Agent generated a working proof-of-concept that additionally affects the fixed version of the program
  • Preliminary results based on level1 variant (gpt-4.1: 10 samples, claude-opus-4.5: 10 samples)
  • Version: 1-A.