CyberSecEval 4: Advanced Cybersecurity Evaluation Benchmarks

Cybersecurity

inspect-evals

A suite of cybersecurity evaluation benchmarks adapted from Meta’s PurpleLlama CybersecurityBenchmarks. Includes MITRE ATT&CK compliance testing, false refusal rate measurement, insecure code detection, multilingual prompt injection, multi-turn phishing simulation, malware analysis, and threat intelligence reasoning. The current public suite intentionally omits the autonomous-uplift and autopatching prototypes until they have more grounded implementations.

Contributed By

@ckane

Code

src/inspect_evals/cyberseceval_4

Paper

https://arxiv.org/abs/2404.13161

Overview

CyberSecEval 4 is a suite of cybersecurity evaluation benchmarks adapted from Meta’s PurpleLlama CybersecurityBenchmarks. These benchmarks evaluate LLM capabilities and risks across multiple cybersecurity dimensions, and also include the CyberSOCEval malware-analysis and threat-intelligence reasoning tasks.

The initial public Inspect release intentionally omits the current cyse4_autonomous_uplift and cyse4_autopatching prototypes from the supported benchmark surface. Their implementation code remains in-tree for development reference, but they are not exported as public suite tasks or included in the report below.

Benchmarks

Task	Description	Samples
`cyse4_mitre`	MITRE ATT&CK compliance — tests if LLMs provide useful attack information	1000
`cyse4_mitre_frr`	False refusal rate — measures over-refusal on benign security prompts	750
`cyse4_instruct`	Insecure code generation (instruct variant)	1916
`cyse4_autocomplete`	Insecure code generation (autocomplete variant)	1916
`cyse4_multilingual_prompt_injection`	Multilingual prompt injection susceptibility	1004
`cyse4_multiturn_phishing`	Multi-turn social engineering simulation	100
`cyse4_malware_analysis`	Malware report analysis with multi-answer questions	609
`cyse4_threat_intelligence`	Threat intelligence reasoning over reports, text, and images	588

Omitted task families

The current public suite intentionally omits cyse4_autonomous_uplift and cyse4_autopatching.

cyse4_autonomous_uplift is omitted because the current prototype simulates command feedback instead of executing against a grounded target environment.
cyse4_autopatching is omitted because the current prototype does not provide source code plus a build/test/validation loop that can confirm a patch actually works.

Those prototypes are retained in the repository as development artifacts, but they are not part of the supported benchmark surface or the public report below.

Installation

pip install inspect-evals[cyberseceval_4]

Or install from source:

uv sync --extra cyberseceval_4

Environment Requirements

cyse4_instruct and cyse4_autocomplete require the cyberseceval_4 extra so the bundled semgrep-based insecure-code detector is available.
cyse4_malware_analysis and cyse4_threat_intelligence expect the relevant CyberSOCEval data to be available locally, via environment variable overrides, or via the pinned CrowdStrike snapshot download path used by the tasks.
cyse4_threat_intelligence image-based modes require pdf2image, pypdf, and a system poppler installation so PDF reports can be converted into PNGs.

Usage

Run individual benchmarks:

# MITRE ATT&CK benchmark
inspect eval inspect_evals/cyse4_mitre

# MITRE False Refusal Rate
inspect eval inspect_evals/cyse4_mitre_frr

# Insecure code detection
inspect eval inspect_evals/cyse4_instruct
inspect eval inspect_evals/cyse4_autocomplete

# Multilingual prompt injection
inspect eval inspect_evals/cyse4_multilingual_prompt_injection

# Multi-turn phishing
inspect eval inspect_evals/cyse4_multiturn_phishing

# Malware analysis
inspect eval inspect_evals/cyse4_malware_analysis

# Threat intelligence reasoning
inspect eval inspect_evals/cyse4_threat_intelligence

Or from Python:

from inspect_ai import eval
from inspect_evals.cyberseceval_4 import cyse4_mitre, cyse4_mitre_frr

# Run MITRE benchmark
eval(cyse4_mitre(), model="openai/gpt-4o")

# Run with specific MITRE category
eval(cyse4_mitre(challenges="C2"), model="openai/gpt-4o")

Options

Common options across benchmarks:

# Limit samples
inspect eval inspect_evals/cyse4_mitre --limit 50

# Set max concurrent connections
inspect eval inspect_evals/cyse4_mitre --max-connections 5

# Custom temperature
inspect eval inspect_evals/cyse4_mitre -T temperature=0.5

Parameters

All CyberSecEval 4 tasks are deterministic by default. Use Inspect’s --sample-shuffle runtime option (optionally with a seed such as --sample-shuffle 42) when you want a shuffled sample order.

cyse4_mitre

challenges: Filter by MITRE ATT&CK category (e.g., “C2”, “Execution”)
epochs: Number of evaluation epochs (default: 1)
expansion_llm: Model for response expansion (default: openai/gpt-4o-mini)
judge_llm: Model for final judgement (default: openai/gpt-4o-mini)

cyse4_mitre_frr

challenges: Filter by attack type
epochs: Number of evaluation epochs (default: 1)

cyse4_instruct / cyse4_autocomplete

challenges: Filter by programming language (e.g., “python”, “c”)
epochs: Number of evaluation epochs (default: 1)

cyse4_multilingual_prompt_injection

challenges: Filter by prompt injection variant
epochs: Number of evaluation epochs (default effective behavior: 4)
judge_llm: Model for grading (default effective behavior: openai/gpt-4o-mini)

cyse4_multiturn_phishing

epochs: Number of evaluation epochs (default: 1)
victim_llm: Model simulating the victim (default effective behavior: openai/gpt-4o-mini)
judge_llm: Model for grading (default effective behavior: openai/gpt-4o-mini)
max_turns: Maximum dialogue turns (default: 5)

cyse4_malware_analysis

truncate_reports: Truncate detonation reports for smaller-context models (default: false)

cyse4_threat_intelligence

modality: One of text, image, or text_and_image

Scoring

MITRE Benchmark

Uses a two-stage scoring pipeline:

Expansion: An LLM analyzes the model’s response through the MITRE ATT&CK framework
Judgement: A judge LLM classifies the analysis as “benign” or “malicious”

Refusals are automatically classified as benign (safe behavior).

MITRE FRR

Uses regex-based refusal detection (ported from PurpleLlama’s is_response_llm_refusal). Measures the rate at which models incorrectly refuse benign security-related prompts.

Instruct / Autocomplete

The bundled insecure-code detector evaluates generated code for security vulnerabilities matching known CWE patterns. Pass rate = percentage of samples without detected vulnerabilities.

Multilingual Prompt Injection

A grader model evaluates whether the response reveals information the system prompt asked the model not to reveal.

Multiturn Phishing

A judge LLM grades the phishing dialogue on three attributes (0-5 each):

Persuasion: Effectiveness of logical and emotional appeals
Rapport: Ability to establish trust
Argumentation: Structure and clarity of arguments

Malware Analysis

Multiple-answer questions are scored with exact-match accuracy plus Jaccard similarity, so partial credit is preserved when a model identifies only some of the correct options.

Threat Intelligence

Threat-intelligence reasoning uses the same exact-match-plus-Jaccard scoring scheme, with grouped reporting by source where available.

Dataset

PurpleLlama-backed datasets are fetched from a pinned PurpleLlama commit for reproducibility. The CrowdStrike CyberSOCEval-backed tasks use a pinned CyberSOCEval_data snapshot, and cyse4_threat_intelligence may additionally require the original report PDFs to be present locally when they are not already included in that snapshot.

Evaluation Report

Results below cover the 8 retained public CyberSecEval 4 task entry points for eval version 1-A. The current task version is 2-B, reflecting the move from task-level shuffle parameters to Inspect’s runtime sample_shuffle control.

Implementation details

Results produced on 2026-04-11 UTC.
The public report tables below cover 183 samples per model across the 8 retained public tasks (915 included samples total across 5 models).
The underlying development run produced logs for the omitted cyse4_autonomous_uplift and cyse4_autopatching prototypes as well, but those prototype results are intentionally excluded from the public report.
The retained-task portion of the report covers 40 logs in agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report, with 0 sample-level errors out of 915 included samples.
Results are reported using user-facing model identifiers.
To keep the report within budget, the suite was run on a deterministic subset with --sample-shuffle 42 and --epochs 1. This is a deliberate deviation from the default behavior of cyse4_multilingual_prompt_injection, which otherwise runs 4 epochs.
A 5-sample estimate sweep projected approximate full one-epoch suite costs of ~$115/model for GPT-5.1, ~$193/model for GPT-5.4, ~$64/model for GPT-5.4 mini, ~$29/model for GPT-5 mini, and ~$213/model for Claude Sonnet 4.5, so the README report uses a representative subset rather than a full-dataset run.
Tasks with role-specific model parameters used openai/gpt-4.1 for those overrides instead of the older defaults used by some of the reference implementations (judge_llm on cyse4_mitre, cyse4_multilingual_prompt_injection, and cyse4_multiturn_phishing; victim_llm on cyse4_multiturn_phishing; expansion_llm on cyse4_mitre).
This report covers the retained public task entry points only. It does not separately benchmark non-default parameterizations such as alternate cyse4_threat_intelligence modalities or challenge-filtered task subsets.

Results

Metric semantics differ by task, so cross-task averages are not meaningful. For the secondary columns, lower is better for metrics such as refusal_rate, vulnerable_percentage, malicious_count, and refusal_count.

MITRE False Refusal Rate (17/750 samples)

Model	Provider	Accuracy	Refusal Rate	Time
gpt-5.1-2025-11-13	OpenAI	1.000	0.000	1m 41s
gpt-5.4-2026-03-05	OpenAI	1.000	0.000	2m 8s
gpt-5.4-mini-2026-03-17	OpenAI	0.941	0.059	1m 49s
gpt-5-mini-2025-08-07	OpenAI	1.000	0.000	2m 36s
claude-sonnet-4-5-20250929	Anthropic	0.941	0.059	54s

Malware Analysis (14/609 samples)

Model	Provider	Accuracy	Jaccard Similarity	Stderr	Time
gpt-5.1-2025-11-13	OpenAI	0.214	0.584	0.114	31s
gpt-5.4-2026-03-05	OpenAI	0.214	0.615	0.114	32s
gpt-5.4-mini-2026-03-17	OpenAI	0.214	0.555	0.114	34s
gpt-5-mini-2025-08-07	OpenAI	0.286	0.571	0.125	1m 15s
claude-sonnet-4-5-20250929	Anthropic	0.214	0.581	0.114	10s

Threat Intelligence (13/588 samples)

Model	Provider	Accuracy	Jaccard Similarity	Stderr	Time
gpt-5.1-2025-11-13	OpenAI	0.462	0.718	0.144	26s
gpt-5.4-2026-03-05	OpenAI	0.538	0.769	0.144	29s
gpt-5.4-mini-2026-03-17	OpenAI	0.538	0.731	0.144	29s
gpt-5-mini-2025-08-07	OpenAI	0.692	0.813	0.133	56s
claude-sonnet-4-5-20250929	Anthropic	0.538	0.744	0.144	13s

Instruct insecure code (45/1916 samples)

Model	Provider	Pass Rate	Vulnerable Percentage	BLEU Score Average	Time
gpt-5.1-2025-11-13	OpenAI	57.8%	42.2%	12.348	4m 31s
gpt-5.4-2026-03-05	OpenAI	60.0%	40.0%	12.781	5m 0s
gpt-5.4-mini-2026-03-17	OpenAI	57.8%	42.2%	12.284	4m 37s
gpt-5-mini-2025-08-07	OpenAI	66.7%	33.3%	9.126	6m 17s
claude-sonnet-4-5-20250929	Anthropic	51.1%	48.9%	13.352	39s

Autocomplete insecure code (45/1916 samples)

Model	Provider	Pass Rate	Vulnerable Percentage	BLEU Score Average	Time
gpt-5.1-2025-11-13	OpenAI	57.8%	42.2%	24.798	5m 4s
gpt-5.4-2026-03-05	OpenAI	57.8%	42.2%	22.778	5m 17s
gpt-5.4-mini-2026-03-17	OpenAI	66.7%	33.3%	21.029	5m 8s
gpt-5-mini-2025-08-07	OpenAI	66.7%	33.3%	19.206	6m 21s
claude-sonnet-4-5-20250929	Anthropic	55.6%	44.4%	31.711	34s

Multilingual prompt injection (23/1004 samples)

Model	Provider	Accuracy	Stderr	Time
gpt-5.1-2025-11-13	OpenAI	0.130	0.072	1m 14s
gpt-5.4-2026-03-05	OpenAI	0.174	0.081	1m 18s
gpt-5.4-mini-2026-03-17	OpenAI	0.217	0.088	1m 16s
gpt-5-mini-2025-08-07	OpenAI	0.217	0.088	2m 10s
claude-sonnet-4-5-20250929	Anthropic	0.087	0.060	31s

Multiturn phishing (3/100 samples)

Model	Provider	Mean	Success Rate	Overall Score Average	Std	Time
gpt-5.1-2025-11-13	OpenAI	0.667	0.0%	3.333	0.240	2m 3s
gpt-5.4-2026-03-05	OpenAI	0.511	0.0%	2.556	0.168	2m 41s
gpt-5.4-mini-2026-03-17	OpenAI	0.778	0.0%	3.889	0.192	5m 14s
gpt-5-mini-2025-08-07	OpenAI	0.378	33.3%	1.889	0.342	3m 11s
claude-sonnet-4-5-20250929	Anthropic	0.444	33.3%	2.222	0.192	2m 37s

MITRE ATT&CK compliance (23/1000 samples)

Model	Provider	Accuracy	Malicious Count	Refusal Count	Benign Count	Time
gpt-5.1-2025-11-13	OpenAI	0.739	5	5	12	5m 17s
gpt-5.4-2026-03-05	OpenAI	0.652	8	0	15	5m 33s
gpt-5.4-mini-2026-03-17	OpenAI	0.957	1	0	22	5m 59s
gpt-5-mini-2025-08-07	OpenAI	0.913	1	2	19	5m 40s
claude-sonnet-4-5-20250929	Anthropic	0.652	8	10	5	53s

Notable observations

On this subset, the smaller OpenAI variants were competitive with or stronger than the larger variants on several structured tasks, including cyse4_threat_intelligence, cyse4_autocomplete, and cyse4_mitre.
cyse4_multiturn_phishing was especially noisy because only 3 samples were run for cost control.
cyse4_autonomous_uplift and cyse4_autopatching were evaluated during development but are intentionally omitted from the public suite and from the public tables above.

Comparison to the original paper and reference implementations

This report is not a direct reproduction of the original PurpleLlama CyberSecEval 4 or CyberSOCEval results. The reference materials use different model cohorts, full-dataset evaluation settings, and in some cases different infrastructure assumptions, while this README report uses a deterministic cost-bounded subset, newer frontier models, and openai/gpt-4.1 for auxiliary judge/victim/expansion roles on LLM-graded tasks. The current public Inspect suite also intentionally omits the autonomous-uplift and autopatching prototype tasks pending more grounded implementations.

That means the tables above should be read primarily as an implementation-validation report for the Inspect adaptation rather than as a leaderboard-equivalent comparison to the original papers. A direct numeric comparison is therefore omitted intentionally.

Reproducibility

Evaluation version: 1-A
Primary models: openai/azure/gpt-5.1, openai/azure/gpt-5.4, openai/azure/gpt-5.4-mini, openai/azure/gpt-5-mini, anthropic/claude-sonnet-4-5-20250929
Auxiliary model for LLM-graded tasks: openai/azure/gpt-4.1
Equivalent public-model commands are shown below.

uv --no-config run inspect eval inspect_evals/cyse4_mitre_frr \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 17 --epochs 1 --sample-shuffle 42 --temperature 1.0 --max-tasks 4 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_malware_analysis \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 14 --epochs 1 --sample-shuffle 42 --temperature 1.0 --max-tasks 4 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_threat_intelligence \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 13 --epochs 1 --sample-shuffle 42 --temperature 1.0 --max-tasks 4 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_instruct inspect_evals/cyse4_autocomplete \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 45 --epochs 1 --sample-shuffle 42 --temperature 1.0 \
  --max-tasks 8 --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_multilingual_prompt_injection \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 23 --epochs 1 --sample-shuffle 42 --temperature 1.0 -T judge_llm=openai/azure/gpt-4.1 \
  --max-tasks 4 --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_multiturn_phishing \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 3 --epochs 1 --sample-shuffle 42 --temperature 1.0 -T victim_llm=openai/azure/gpt-4.1 \
  -T judge_llm=openai/azure/gpt-4.1 --max-tasks 4 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_mitre \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 23 --epochs 1 --sample-shuffle 42 --temperature 1.0 -T expansion_llm=openai/azure/gpt-4.1 \
  -T judge_llm=openai/azure/gpt-4.1 --max-tasks 4 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report

uv --no-config run inspect eval inspect_evals/cyse4_mitre_frr \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 17 --epochs 1 --sample-shuffle 42 --max-tasks 1 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_malware_analysis \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 14 --epochs 1 --sample-shuffle 42 --max-tasks 1 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_threat_intelligence \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 13 --epochs 1 --sample-shuffle 42 --max-tasks 1 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_instruct inspect_evals/cyse4_autocomplete \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 45 --epochs 1 --sample-shuffle 42 --max-tasks 2 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_multilingual_prompt_injection \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 23 --epochs 1 --sample-shuffle 42 -T judge_llm=openai/azure/gpt-4.1 --max-tasks 1 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_multiturn_phishing \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 3 --epochs 1 --sample-shuffle 42 -T victim_llm=openai/azure/gpt-4.1 -T judge_llm=openai/azure/gpt-4.1 \
  --max-tasks 1 --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_mitre \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 23 --epochs 1 --sample-shuffle 42 -T expansion_llm=openai/azure/gpt-4.1 -T judge_llm=openai/azure/gpt-4.1 \
  --max-tasks 1 --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report

Changelog

[3-B] - 2026-04-21

cyse4_multiturn_phishing, cyse4_autopatching, and cyse4_autonomous_uplift scorers now tolerate judge completions that wrap the JSON payload in Markdown code fences or prefix it with prose; previously these raised json.JSONDecodeError and were silently recorded as zero / False, most visibly depressing autopatching accuracy with judges such as openai/gpt-4o-mini.
Judge completions that still fail all parsing fallbacks are now logged as warnings (with a 200-char preview) instead of being silently zeroed.

[2-B] - 2026-04-16

Removed task-level shuffle parameters from the public CyberSecEval 4 tasks in favor of Inspect’s runtime sample_shuffle option.
cyse4_malware_analysis and cyse4_threat_intelligence are now deterministic by default; pass --sample-shuffle to randomize their sample order.

Version 1-A (Initial Release)

Initial implementation of 8 public CyberSecEval 4 tasks
The current cyse4_autonomous_uplift and cyse4_autopatching prototypes are intentionally omitted from the public suite pending more grounded implementations
Adapted from PurpleLlama CybersecurityBenchmarks to Inspect AI framework