SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.

Cybersecurity

Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.

Contributed By

@kingroryg

Code

src/inspect_evals/sevenllm

Paper

https://arxiv.org/abs/2405.03446

Overview

SEvenLLM is a benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install "inspect_evals[sevenllm]@git+https://github.com/UKGovernmentBEIS/inspect_evals"

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev,sevenllm]"

Then, evaluate against one or more models with:

inspect eval inspect_evals/sevenllm_mcq_zh --model openai/gpt-4o
inspect eval inspect_evals/sevenllm_mcq_en --model openai/gpt-4o
inspect eval inspect_evals/sevenllm_qa_zh --model openai/gpt-4o
inspect eval inspect_evals/sevenllm_qa_en --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Dataset

The dataset described in the paper is a bilingual (English and Chinese) instruction dataset specifically created to train LLMs for cybersecurity tasks. It consists of 28 well-defined tasks, categorized into understanding tasks and generation tasks related to cybersecurity incident analysis.

Understanding Tasks (e.g., key entity recognition, malware feature extraction, vulnerability intelligence extraction)
Generation Tasks (e.g., attack strategy analysis, trend prediction, summary generation)
Size: benchmark contains 1,300 test samples with multiple-choice and query-answer questions.

A phishing campaign targeted employees at a major company using fake login portals. The attacker utilized domain phishing[.]com and sent emails requesting password updates. This attack exploited a vulnerability in the email server to bypass spam filters.

Task: Key Entity Recognition

Instruction: Identify key entities such as attacker, victim, attack tools, and indicators of compromise from the text.

Output:

Attacker Information: Unknown (phishing campaign organizer)

Victim: Employees of a major company

Domain Name: phishing[.]com

Vulnerability Exploited: Email server spam filter bypass

Attack Tool: Fake login portals

Scoring

All evaluations involve key metrics such as Rouge-L, which measures overlap between model-generated and reference responses, and semantic similarity, which uses multilingual sentence transformers to assess the meaning alignment of outputs.