SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security
“Security Question Answering” dataset to assess LLMs’ understanding and application of security principles. SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions were generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook and vetted by humans.
Overview
SecQA (Security Question Answering) is a benchmark for evaluating the performance of Large Language Models (LLMs) in the domain of computer security. Utilizing multiple-choice questions generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook, SecQA aims to assess LLMs’ understanding and application of security principles.
SecQA is organized into two versions: v1 and v2. Both are implemented here in both zero-shot and 5-shot settings. Version 1 is designed to assess foundational understanding, presenting questions that cover basic concepts and widely recognized security principles. It serves as a preliminary test to gauge LLMs’ basic comprehension and application of security knowledge. Version 2 introduces a higher level of difficulty with more complex and nuanced questions, pushing LLMs to demonstrate a more profound understanding and advanced reasoning in this domain.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Then, evaluate against one or more models with:
inspect eval inspect_evals/sec_qa_v1 --model openai/gpt-4o
inspect eval inspect_evals/sec_qa_v1_5_shot --model openai/gpt-4o
inspect eval inspect_evals/sec_qa_v2 --model openai/gpt-4o
inspect eval inspect_evals/sec_qa_v2_5_shot --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/sec_qa_v1 --limit 10
inspect eval inspect_evals/sec_qa_v1_5_shot --max-connections 10
inspect eval inspect_evals/sec_qa_v2 --temperature 0.5
See inspect eval --help
for all available options.
Dataset
Here is an example from the dataset:
Question: What is the purpose of implementing a Guest Wireless Network in a corporate environment?
To provide unrestricted access to company resources
To offer a separate, secure network for visitors
To replace the primary corporate wireless network
To bypass network security protocols
Correct Answer: B
The model is tasked to choose one of the four options.
Scoring
A simple accuracy is calculated over the datapoints.