CyberSecEval 4: Advanced Cybersecurity Evaluation Benchmarks

Cybersecurity

A suite of cybersecurity evaluation benchmarks adapted from Meta’s PurpleLlama CybersecurityBenchmarks. Includes MITRE ATT&CK compliance testing, false refusal rate measurement, insecure code detection, multilingual prompt injection, multi-turn phishing simulation, malware analysis, and threat intelligence reasoning. The current public suite intentionally omits the autonomous-uplift and autopatching prototypes until they have more grounded implementations.

Overview

CyberSecEval 4 is a suite of cybersecurity evaluation benchmarks adapted from Meta’s PurpleLlama CybersecurityBenchmarks. These benchmarks evaluate LLM capabilities and risks across multiple cybersecurity dimensions, and also include the CyberSOCEval malware-analysis and threat-intelligence reasoning tasks.

The initial public Inspect release intentionally omits the current cyse4_autonomous_uplift and cyse4_autopatching prototypes from the supported benchmark surface. Their implementation code remains in-tree for development reference, but they are not exported as public suite tasks or included in the report below.

Benchmarks

Task Description Samples
cyse4_mitre MITRE ATT&CK compliance — tests if LLMs provide useful attack information 1000
cyse4_mitre_frr False refusal rate — measures over-refusal on benign security prompts 750
cyse4_instruct Insecure code generation (instruct variant) 1916
cyse4_autocomplete Insecure code generation (autocomplete variant) 1916
cyse4_multilingual_prompt_injection Multilingual prompt injection susceptibility 1004
cyse4_multiturn_phishing Multi-turn social engineering simulation 100
cyse4_malware_analysis Malware report analysis with multi-answer questions 609
cyse4_threat_intelligence Threat intelligence reasoning over reports, text, and images 588

Omitted task families

The current public suite intentionally omits cyse4_autonomous_uplift and cyse4_autopatching.

  • cyse4_autonomous_uplift is omitted because the current prototype simulates command feedback instead of executing against a grounded target environment.
  • cyse4_autopatching is omitted because the current prototype does not provide source code plus a build/test/validation loop that can confirm a patch actually works.

Those prototypes are retained in the repository as development artifacts, but they are not part of the supported benchmark surface or the public report below.

Installation

pip install inspect-evals[cyberseceval_4]

Or install from source:

uv sync --extra cyberseceval_4

Environment Requirements

  • cyse4_instruct and cyse4_autocomplete require the cyberseceval_4 extra so the bundled semgrep-based insecure-code detector is available.
  • cyse4_malware_analysis and cyse4_threat_intelligence expect the relevant CyberSOCEval data to be available locally, via environment variable overrides, or via the pinned CrowdStrike snapshot download path used by the tasks.
  • cyse4_threat_intelligence image-based modes require pdf2image, pypdf, and a system poppler installation so PDF reports can be converted into PNGs.

Usage

Run individual benchmarks:

# MITRE ATT&CK benchmark
inspect eval inspect_evals/cyse4_mitre

# MITRE False Refusal Rate
inspect eval inspect_evals/cyse4_mitre_frr

# Insecure code detection
inspect eval inspect_evals/cyse4_instruct
inspect eval inspect_evals/cyse4_autocomplete

# Multilingual prompt injection
inspect eval inspect_evals/cyse4_multilingual_prompt_injection

# Multi-turn phishing
inspect eval inspect_evals/cyse4_multiturn_phishing

# Malware analysis
inspect eval inspect_evals/cyse4_malware_analysis

# Threat intelligence reasoning
inspect eval inspect_evals/cyse4_threat_intelligence

Or from Python:

from inspect_ai import eval
from inspect_evals.cyberseceval_4 import cyse4_mitre, cyse4_mitre_frr

# Run MITRE benchmark
eval(cyse4_mitre(), model="openai/gpt-4o")

# Run with specific MITRE category
eval(cyse4_mitre(challenges="C2"), model="openai/gpt-4o")

Options

Common options across benchmarks:

# Limit samples
inspect eval inspect_evals/cyse4_mitre --limit 50

# Set max concurrent connections
inspect eval inspect_evals/cyse4_mitre --max-connections 5

# Custom temperature
inspect eval inspect_evals/cyse4_mitre -T temperature=0.5

Parameters

All CyberSecEval 4 tasks are deterministic by default. Use Inspect’s --sample-shuffle runtime option (optionally with a seed such as --sample-shuffle 42) when you want a shuffled sample order.

cyse4_mitre

  • challenges: Filter by MITRE ATT&CK category (e.g., “C2”, “Execution”)
  • epochs: Number of evaluation epochs (default: 1)
  • expansion_llm: Model for response expansion (default: openai/gpt-4o-mini)
  • judge_llm: Model for final judgement (default: openai/gpt-4o-mini)

cyse4_mitre_frr

  • challenges: Filter by attack type
  • epochs: Number of evaluation epochs (default: 1)

cyse4_instruct / cyse4_autocomplete

  • challenges: Filter by programming language (e.g., “python”, “c”)
  • epochs: Number of evaluation epochs (default: 1)

cyse4_multilingual_prompt_injection

  • challenges: Filter by prompt injection variant
  • epochs: Number of evaluation epochs (default effective behavior: 4)
  • judge_llm: Model for grading (default effective behavior: openai/gpt-4o-mini)

cyse4_multiturn_phishing

  • epochs: Number of evaluation epochs (default: 1)
  • victim_llm: Model simulating the victim (default effective behavior: openai/gpt-4o-mini)
  • judge_llm: Model for grading (default effective behavior: openai/gpt-4o-mini)
  • max_turns: Maximum dialogue turns (default: 5)

cyse4_malware_analysis

  • truncate_reports: Truncate detonation reports for smaller-context models (default: false)

cyse4_threat_intelligence

  • modality: One of text, image, or text_and_image

Scoring

MITRE Benchmark

Uses a two-stage scoring pipeline:

  1. Expansion: An LLM analyzes the model’s response through the MITRE ATT&CK framework
  2. Judgement: A judge LLM classifies the analysis as “benign” or “malicious”

Refusals are automatically classified as benign (safe behavior).

MITRE FRR

Uses regex-based refusal detection (ported from PurpleLlama’s is_response_llm_refusal). Measures the rate at which models incorrectly refuse benign security-related prompts.

Instruct / Autocomplete

The bundled insecure-code detector evaluates generated code for security vulnerabilities matching known CWE patterns. Pass rate = percentage of samples without detected vulnerabilities.

Multilingual Prompt Injection

A grader model evaluates whether the response reveals information the system prompt asked the model not to reveal.

Multiturn Phishing

A judge LLM grades the phishing dialogue on three attributes (0-5 each):

  • Persuasion: Effectiveness of logical and emotional appeals
  • Rapport: Ability to establish trust
  • Argumentation: Structure and clarity of arguments

Malware Analysis

Multiple-answer questions are scored with exact-match accuracy plus Jaccard similarity, so partial credit is preserved when a model identifies only some of the correct options.

Threat Intelligence

Threat-intelligence reasoning uses the same exact-match-plus-Jaccard scoring scheme, with grouped reporting by source where available.

Dataset

PurpleLlama-backed datasets are fetched from a pinned PurpleLlama commit for reproducibility. The CrowdStrike CyberSOCEval-backed tasks use a pinned CyberSOCEval_data snapshot, and cyse4_threat_intelligence may additionally require the original report PDFs to be present locally when they are not already included in that snapshot.

Evaluation Report

Results below cover the 8 retained public CyberSecEval 4 task entry points for eval version 1-A. The current task version is 2-B, reflecting the move from task-level shuffle parameters to Inspect’s runtime sample_shuffle control.

Implementation details

  • Results produced on 2026-04-11 UTC.
  • The public report tables below cover 183 samples per model across the 8 retained public tasks (915 included samples total across 5 models).
  • The underlying development run produced logs for the omitted cyse4_autonomous_uplift and cyse4_autopatching prototypes as well, but those prototype results are intentionally excluded from the public report.
  • The retained-task portion of the report covers 40 logs in agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report, with 0 sample-level errors out of 915 included samples.
  • Results are reported using user-facing model identifiers.
  • To keep the report within budget, the suite was run on a deterministic subset with --sample-shuffle 42 and --epochs 1. This is a deliberate deviation from the default behavior of cyse4_multilingual_prompt_injection, which otherwise runs 4 epochs.
  • A 5-sample estimate sweep projected approximate full one-epoch suite costs of ~$115/model for GPT-5.1, ~$193/model for GPT-5.4, ~$64/model for GPT-5.4 mini, ~$29/model for GPT-5 mini, and ~$213/model for Claude Sonnet 4.5, so the README report uses a representative subset rather than a full-dataset run.
  • Tasks with role-specific model parameters used openai/gpt-4.1 for those overrides instead of the older defaults used by some of the reference implementations (judge_llm on cyse4_mitre, cyse4_multilingual_prompt_injection, and cyse4_multiturn_phishing; victim_llm on cyse4_multiturn_phishing; expansion_llm on cyse4_mitre).
  • This report covers the retained public task entry points only. It does not separately benchmark non-default parameterizations such as alternate cyse4_threat_intelligence modalities or challenge-filtered task subsets.

Results

Metric semantics differ by task, so cross-task averages are not meaningful. For the secondary columns, lower is better for metrics such as refusal_rate, vulnerable_percentage, malicious_count, and refusal_count.

MITRE False Refusal Rate (17/750 samples)

Model Provider Accuracy Refusal Rate Time
gpt-5.1-2025-11-13 OpenAI 1.000 0.000 1m 41s
gpt-5.4-2026-03-05 OpenAI 1.000 0.000 2m 8s
gpt-5.4-mini-2026-03-17 OpenAI 0.941 0.059 1m 49s
gpt-5-mini-2025-08-07 OpenAI 1.000 0.000 2m 36s
claude-sonnet-4-5-20250929 Anthropic 0.941 0.059 54s

Malware Analysis (14/609 samples)

Model Provider Accuracy Jaccard Similarity Stderr Time
gpt-5.1-2025-11-13 OpenAI 0.214 0.584 0.114 31s
gpt-5.4-2026-03-05 OpenAI 0.214 0.615 0.114 32s
gpt-5.4-mini-2026-03-17 OpenAI 0.214 0.555 0.114 34s
gpt-5-mini-2025-08-07 OpenAI 0.286 0.571 0.125 1m 15s
claude-sonnet-4-5-20250929 Anthropic 0.214 0.581 0.114 10s

Threat Intelligence (13/588 samples)

Model Provider Accuracy Jaccard Similarity Stderr Time
gpt-5.1-2025-11-13 OpenAI 0.462 0.718 0.144 26s
gpt-5.4-2026-03-05 OpenAI 0.538 0.769 0.144 29s
gpt-5.4-mini-2026-03-17 OpenAI 0.538 0.731 0.144 29s
gpt-5-mini-2025-08-07 OpenAI 0.692 0.813 0.133 56s
claude-sonnet-4-5-20250929 Anthropic 0.538 0.744 0.144 13s

Instruct insecure code (45/1916 samples)

Model Provider Pass Rate Vulnerable Percentage BLEU Score Average Time
gpt-5.1-2025-11-13 OpenAI 57.8% 42.2% 12.348 4m 31s
gpt-5.4-2026-03-05 OpenAI 60.0% 40.0% 12.781 5m 0s
gpt-5.4-mini-2026-03-17 OpenAI 57.8% 42.2% 12.284 4m 37s
gpt-5-mini-2025-08-07 OpenAI 66.7% 33.3% 9.126 6m 17s
claude-sonnet-4-5-20250929 Anthropic 51.1% 48.9% 13.352 39s

Autocomplete insecure code (45/1916 samples)

Model Provider Pass Rate Vulnerable Percentage BLEU Score Average Time
gpt-5.1-2025-11-13 OpenAI 57.8% 42.2% 24.798 5m 4s
gpt-5.4-2026-03-05 OpenAI 57.8% 42.2% 22.778 5m 17s
gpt-5.4-mini-2026-03-17 OpenAI 66.7% 33.3% 21.029 5m 8s
gpt-5-mini-2025-08-07 OpenAI 66.7% 33.3% 19.206 6m 21s
claude-sonnet-4-5-20250929 Anthropic 55.6% 44.4% 31.711 34s

Multilingual prompt injection (23/1004 samples)

Model Provider Accuracy Stderr Time
gpt-5.1-2025-11-13 OpenAI 0.130 0.072 1m 14s
gpt-5.4-2026-03-05 OpenAI 0.174 0.081 1m 18s
gpt-5.4-mini-2026-03-17 OpenAI 0.217 0.088 1m 16s
gpt-5-mini-2025-08-07 OpenAI 0.217 0.088 2m 10s
claude-sonnet-4-5-20250929 Anthropic 0.087 0.060 31s

Multiturn phishing (3/100 samples)

Model Provider Mean Success Rate Overall Score Average Std Time
gpt-5.1-2025-11-13 OpenAI 0.667 0.0% 3.333 0.240 2m 3s
gpt-5.4-2026-03-05 OpenAI 0.511 0.0% 2.556 0.168 2m 41s
gpt-5.4-mini-2026-03-17 OpenAI 0.778 0.0% 3.889 0.192 5m 14s
gpt-5-mini-2025-08-07 OpenAI 0.378 33.3% 1.889 0.342 3m 11s
claude-sonnet-4-5-20250929 Anthropic 0.444 33.3% 2.222 0.192 2m 37s

MITRE ATT&CK compliance (23/1000 samples)

Model Provider Accuracy Malicious Count Refusal Count Benign Count Time
gpt-5.1-2025-11-13 OpenAI 0.739 5 5 12 5m 17s
gpt-5.4-2026-03-05 OpenAI 0.652 8 0 15 5m 33s
gpt-5.4-mini-2026-03-17 OpenAI 0.957 1 0 22 5m 59s
gpt-5-mini-2025-08-07 OpenAI 0.913 1 2 19 5m 40s
claude-sonnet-4-5-20250929 Anthropic 0.652 8 10 5 53s

Notable observations

  • On this subset, the smaller OpenAI variants were competitive with or stronger than the larger variants on several structured tasks, including cyse4_threat_intelligence, cyse4_autocomplete, and cyse4_mitre.
  • cyse4_multiturn_phishing was especially noisy because only 3 samples were run for cost control.
  • cyse4_autonomous_uplift and cyse4_autopatching were evaluated during development but are intentionally omitted from the public suite and from the public tables above.

Comparison to the original paper and reference implementations

This report is not a direct reproduction of the original PurpleLlama CyberSecEval 4 or CyberSOCEval results. The reference materials use different model cohorts, full-dataset evaluation settings, and in some cases different infrastructure assumptions, while this README report uses a deterministic cost-bounded subset, newer frontier models, and openai/gpt-4.1 for auxiliary judge/victim/expansion roles on LLM-graded tasks. The current public Inspect suite also intentionally omits the autonomous-uplift and autopatching prototype tasks pending more grounded implementations.

That means the tables above should be read primarily as an implementation-validation report for the Inspect adaptation rather than as a leaderboard-equivalent comparison to the original papers. A direct numeric comparison is therefore omitted intentionally.

Reproducibility

  • Evaluation version: 1-A
  • Primary models: openai/azure/gpt-5.1, openai/azure/gpt-5.4, openai/azure/gpt-5.4-mini, openai/azure/gpt-5-mini, anthropic/claude-sonnet-4-5-20250929
  • Auxiliary model for LLM-graded tasks: openai/azure/gpt-4.1
  • Equivalent public-model commands are shown below.
uv --no-config run inspect eval inspect_evals/cyse4_mitre_frr \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 17 --epochs 1 --sample-shuffle 42 --temperature 1.0 --max-tasks 4 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_malware_analysis \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 14 --epochs 1 --sample-shuffle 42 --temperature 1.0 --max-tasks 4 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_threat_intelligence \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 13 --epochs 1 --sample-shuffle 42 --temperature 1.0 --max-tasks 4 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_instruct inspect_evals/cyse4_autocomplete \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 45 --epochs 1 --sample-shuffle 42 --temperature 1.0 \
  --max-tasks 8 --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_multilingual_prompt_injection \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 23 --epochs 1 --sample-shuffle 42 --temperature 1.0 -T judge_llm=openai/azure/gpt-4.1 \
  --max-tasks 4 --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_multiturn_phishing \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 3 --epochs 1 --sample-shuffle 42 --temperature 1.0 -T victim_llm=openai/azure/gpt-4.1 \
  -T judge_llm=openai/azure/gpt-4.1 --max-tasks 4 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_mitre \
  --model openai/azure/gpt-5.1,openai/azure/gpt-5.4,openai/azure/gpt-5.4-mini,openai/azure/gpt-5-mini \
  --limit 23 --epochs 1 --sample-shuffle 42 --temperature 1.0 -T expansion_llm=openai/azure/gpt-4.1 \
  -T judge_llm=openai/azure/gpt-4.1 --max-tasks 4 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_mitre_frr \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 17 --epochs 1 --sample-shuffle 42 --max-tasks 1 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_malware_analysis \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 14 --epochs 1 --sample-shuffle 42 --max-tasks 1 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_threat_intelligence \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 13 --epochs 1 --sample-shuffle 42 --max-tasks 1 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_instruct inspect_evals/cyse4_autocomplete \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 45 --epochs 1 --sample-shuffle 42 --max-tasks 2 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_multilingual_prompt_injection \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 23 --epochs 1 --sample-shuffle 42 -T judge_llm=openai/azure/gpt-4.1 --max-tasks 1 \
  --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_multiturn_phishing \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 3 --epochs 1 --sample-shuffle 42 -T victim_llm=openai/azure/gpt-4.1 -T judge_llm=openai/azure/gpt-4.1 \
  --max-tasks 1 --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report
uv --no-config run inspect eval inspect_evals/cyse4_mitre \
  --model anthropic/claude-sonnet-4-5-20250929 \
  --limit 23 --epochs 1 --sample-shuffle 42 -T expansion_llm=openai/azure/gpt-4.1 -T judge_llm=openai/azure/gpt-4.1 \
  --max-tasks 1 --log-dir agent_artefacts/cyberseceval_4_1_A/evalreport/logs_report

Changelog

[2-B] - 2026-04-16

  • Removed task-level shuffle parameters from the public CyberSecEval 4 tasks in favor of Inspect’s runtime sample_shuffle option.
  • cyse4_malware_analysis and cyse4_threat_intelligence are now deterministic by default; pass --sample-shuffle to randomize their sample order.

Version 1-A (Initial Release)

  • Initial implementation of 8 public CyberSecEval 4 tasks
  • The current cyse4_autonomous_uplift and cyse4_autopatching prototypes are intentionally omitted from the public suite pending more grounded implementations
  • Adapted from PurpleLlama CybersecurityBenchmarks to Inspect AI framework