CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale
A large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 benchmark instances with historical vulnerabilities from 188 large software projects.
Overview
CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale
A large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 benchmark instances with historical vulnerabilities from 188 large software projects.
Set up the tasks directory
If you have the CyberGym dataset, make a link to it:
ln -s /data/cybergym/data src/inspect_evals/cybergym/tasksFor example, inspect_evals/src/inspect_evals/cybergym/tasks/arvo/47101/description.txt (this file) should exist.
Or, if you do not have the dataset, the evaluation will download it from Hugging Face. The full dataset is 236 GB. Use e.g. inspect eval inspect_evals/cybergym -T eval_names='["cybergym_arvo_47101",...]' to download only a subset.
Example invocation
uv run inspect eval inspect_evals/cybergym --model openai/gpt-4.1 --limit 5About the Docker Compose sandbox
We have the following containers:
default, where the solver workscybergym-vulnerable, with the vulnerable programcybergym-fixed, with the fixed programcybergym, which routes submissions from the solver and scorer to the vulnerable and fixed versions of the program
graph TB
solver[/solver/] --> default[default]
scorer[/scorer/] --> cybergym[cybergym]
default --> cg_public([cg_public])
cg_public --> cybergym
cybergym --> cg_private([cg_private])
cg_private --> cybergym_vulnerable[cybergym-vulnerable]
cg_private --> cybergym_fixed[cybergym-fixed]
cybergym --> proof_of_concept_data[(proof_of_concept_data)]
proof_of_concept_data --> cybergym_vulnerable
proof_of_concept_data --> cybergym_fixed
Sample IDs
Sample IDs follow the format cybergym_{dataset}_{task_num} ({variant}), for example cybergym_arvo_47101 (level1) (note the space and parentheses).
{dataset} is arvo for ARVO or oss_fuzz for OSS-Fuzz.
{variant} is e.g. level1. See VARIANTS table in dataset.py for what is provided to the agent in each. level0 and level1 are the hardest, with the least information provided to the solver, while level3 is the easiest, with the most information provided to the solver.
Evaluation Report
This evaluation is under construction. These preliminary results are on a subset of 10 samples (see subset).
| Model | Provider | Reproduced | Stderr | New Vuln | Stderr | Time |
|---|---|---|---|---|---|---|
| gpt-4.1 | OpenAI | 0.700 | 0.483 | 0.000 | 0.000 | 44m 58s |
| claude-opus-4.5-20251101 | Anthropic | 1.000 | 0.000 | 0.000 | 0.000 | 13m 11s |
In materials released with the paper, the authors describe 5 of the samples as easy and 5 samples as not easy.
Notes:
- Preliminary results based on 10 samples (level1 variant)
- Reproduced: Agent generated a working proof-of-concept
- New Vuln: Agent generated a working proof-of-concept that additionally affects the fixed version of the program
- Preliminary results based on level1 variant (gpt-4.1: 10 samples, claude-opus-4.5: 10 samples)
- Version: 1-A.