SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.

Cybersecurity

Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.

Overview

SEvenLLM is a benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.

Usage

First, install the dependencies:

uv sync --extra sevenllm

Then, evaluate against one or more models with:

uv run inspect eval inspect_evals/sevenllm_mcq_zh --model openai/gpt-5-nano
uv run inspect eval inspect_evals/sevenllm_mcq_en --model openai/gpt-5-nano
uv run inspect eval inspect_evals/sevenllm_qa_zh --model openai/gpt-5-nano
uv run inspect eval inspect_evals/sevenllm_qa_en --model openai/gpt-5-nano

To run multiple tasks simulteneously use inspect eval-set:

uv run inspect eval-set inspect_evals/sevenllm_mcq_zh inspect_evals/sevenllm_mcq_en inspect_evals/sevenllm_qa_zh inspect_evals/sevenllm_qa_en

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval, eval_set
from inspect_evals.sevenllm import sevenllm_mcq_zh, sevenllm_mcq_en, sevenllm_qa_zh, sevenllm_qa_en
eval(sevenllm_mcq_zh)
eval_set([sevenllm_mcq_zh, sevenllm_mcq_en, sevenllm_qa_zh, sevenllm_qa_en], log_dir='logs-run-42')

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Dataset

The dataset described in the paper is a bilingual (English and Chinese) instruction dataset specifically created to train LLMs for cybersecurity tasks. It consists of 28 well-defined tasks, categorized into understanding tasks and generation tasks related to cybersecurity incident analysis.

  1. Understanding Tasks (e.g., key entity recognition, malware feature extraction, vulnerability intelligence extraction)
  2. Generation Tasks (e.g., attack strategy analysis, trend prediction, summary generation)
  3. Size: benchmark contains 1,300 test samples with multiple-choice and query-answer questions.

A phishing campaign targeted employees at a major company using fake login portals. The attacker utilized domain phishing[.]com and sent emails requesting password updates. This attack exploited a vulnerability in the email server to bypass spam filters.

Task: Key Entity Recognition

Instruction: Identify key entities such as attacker, victim, attack tools, and indicators of compromise from the text.

Output:

Attacker Information: Unknown (phishing campaign organizer)

Victim: Employees of a major company

Domain Name: phishing[.]com

Vulnerability Exploited: Email server spam filter bypass

Attack Tool: Fake login portals

Scoring

All evaluations involve key metrics such as Rouge-L, which measures overlap between model-generated and reference responses, and semantic similarity, which uses multilingual sentence transformers to assess the meaning alignment of outputs.