SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.
Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
Overview
SEvenLLM is a benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
pip install jieba sentence_transformers rouge # packages required for the sevenllm benchmark
Then, evaluate against one or more models with:
inspect eval inspect_evals/sevenllm_mcq_zh --model openai/gpt-4o
inspect eval inspect_evals/sevenllm_mcq_en --model openai/gpt-4o
inspect eval inspect_evals/sevenllm_qa_zh --model openai/gpt-4o
inspect eval inspect_evals/sevenllm_qa_en --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Dataset
The dataset described in the paper is a bilingual (English and Chinese) instruction dataset specifically created to train LLMs for cybersecurity tasks. It consists of 28 well-defined tasks, categorized into understanding tasks and generation tasks related to cybersecurity incident analysis.
- Understanding Tasks (e.g., key entity recognition, malware feature extraction, vulnerability intelligence extraction)
- Generation Tasks (e.g., attack strategy analysis, trend prediction, summary generation)
- Size: benchmark contains 1,300 test samples with multiple-choice and query-answer questions.
A phishing campaign targeted employees at a major company using fake login portals. The attacker utilized domain phishing[.]com and sent emails requesting password updates. This attack exploited a vulnerability in the email server to bypass spam filters.
Task: Key Entity Recognition
Instruction: Identify key entities such as attacker, victim, attack tools, and indicators of compromise from the text.
Output:
Attacker Information: Unknown (phishing campaign organizer)
Victim: Employees of a major company
Domain Name: phishing[.]com
Vulnerability Exploited: Email server spam filter bypass
Attack Tool: Fake login portals
Scoring
All evaluations involve key metrics such as Rouge-L, which measures overlap between model-generated and reference responses, and semantic similarity, which uses multilingual sentence transformers to assess the meaning alignment of outputs.