SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.
Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
Overview
SEvenLLM is a benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.
Usage
First, install the dependencies:
uv sync --extra sevenllm
Then, evaluate against one or more models with:
uv run inspect eval inspect_evals/sevenllm_mcq_zh --model openai/gpt-5-nano
uv run inspect eval inspect_evals/sevenllm_mcq_en --model openai/gpt-5-nano
uv run inspect eval inspect_evals/sevenllm_qa_zh --model openai/gpt-5-nano
uv run inspect eval inspect_evals/sevenllm_qa_en --model openai/gpt-5-nano
To run multiple tasks simulteneously use inspect eval-set
:
uv run inspect eval-set inspect_evals/sevenllm_mcq_zh inspect_evals/sevenllm_mcq_en inspect_evals/sevenllm_qa_zh inspect_evals/sevenllm_qa_en
You can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval, eval_set
from inspect_evals.sevenllm import sevenllm_mcq_zh, sevenllm_mcq_en, sevenllm_qa_zh, sevenllm_qa_en
eval(sevenllm_mcq_zh)
='logs-run-42') eval_set([sevenllm_mcq_zh, sevenllm_mcq_en, sevenllm_qa_zh, sevenllm_qa_en], log_dir
After running evaluations, you can view their logs using the inspect view
command:
uv run inspect view
For VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>
Dataset
The dataset described in the paper is a bilingual (English and Chinese) instruction dataset specifically created to train LLMs for cybersecurity tasks. It consists of 28 well-defined tasks, categorized into understanding tasks and generation tasks related to cybersecurity incident analysis.
- Understanding Tasks (e.g., key entity recognition, malware feature extraction, vulnerability intelligence extraction)
- Generation Tasks (e.g., attack strategy analysis, trend prediction, summary generation)
- Size: benchmark contains 1,300 test samples with multiple-choice and query-answer questions.
A phishing campaign targeted employees at a major company using fake login portals. The attacker utilized domain phishing[.]com and sent emails requesting password updates. This attack exploited a vulnerability in the email server to bypass spam filters.
Task: Key Entity Recognition
Instruction: Identify key entities such as attacker, victim, attack tools, and indicators of compromise from the text.
Output:
Attacker Information: Unknown (phishing campaign organizer)
Victim: Employees of a major company
Domain Name: phishing[.]com
Vulnerability Exploited: Email server spam filter bypass
Attack Tool: Fake login portals
Scoring
All evaluations involve key metrics such as Rouge-L, which measures overlap between model-generated and reference responses, and semantic similarity, which uses multilingual sentence transformers to assess the meaning alignment of outputs.