SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.
Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
Overview
SEvenLLM is a benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evals[sevenllm]If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv sync --extra sevenllmRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/sevenllm_mcq_zh --model openai/gpt-5-nano
uv run inspect eval inspect_evals/sevenllm_mcq_en --model openai/gpt-5-nano
uv run inspect eval inspect_evals/sevenllm_qa_zh --model openai/gpt-5-nano
uv run inspect eval inspect_evals/sevenllm_qa_en --model openai/gpt-5-nanoTo run multiple tasks simulteneously use inspect eval-set:
uv run inspect eval-set inspect_evals/sevenllm_mcq_zh inspect_evals/sevenllm_mcq_en inspect_evals/sevenllm_qa_zh inspect_evals/sevenllm_qa_enYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval, eval_set
from inspect_evals.sevenllm import sevenllm_mcq_zh, sevenllm_mcq_en, sevenllm_qa_zh, sevenllm_qa_en
eval(sevenllm_mcq_zh)
eval_set([sevenllm_mcq_zh, sevenllm_mcq_en, sevenllm_qa_zh, sevenllm_qa_en], log_dir='logs-run-42')After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Dataset
The dataset described in the paper is a bilingual (English and Chinese) instruction dataset specifically created to train LLMs for cybersecurity tasks. It consists of 28 well-defined tasks, categorized into understanding tasks and generation tasks related to cybersecurity incident analysis.
- Understanding Tasks (e.g., key entity recognition, malware feature extraction, vulnerability intelligence extraction)
- Generation Tasks (e.g., attack strategy analysis, trend prediction, summary generation)
- Size: benchmark contains 1,300 test samples with multiple-choice and query-answer questions.
A phishing campaign targeted employees at a major company using fake login portals. The attacker utilized domain phishing[.]com and sent emails requesting password updates. This attack exploited a vulnerability in the email server to bypass spam filters.
Task: Key Entity Recognition
Instruction: Identify key entities such as attacker, victim, attack tools, and indicators of compromise from the text.
Output:
Attacker Information: Unknown (phishing campaign organizer)
Victim: Employees of a major company
Domain Name: phishing[.]com
Vulnerability Exploited: Email server spam filter bypass
Attack Tool: Fake login portals
Scoring
All evaluations involve key metrics such as Rouge-L, which measures overlap between model-generated and reference responses, and semantic similarity, which uses multilingual sentence transformers to assess the meaning alignment of outputs.