SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security

Cybersecurity

“Security Question Answering” dataset to assess LLMs’ understanding and application of security principles. SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions were generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook and vetted by humans.

Overview

SecQA (Security Question Answering) is a benchmark for evaluating the performance of Large Language Models (LLMs) in the domain of computer security. Utilizing multiple-choice questions generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook, SecQA aims to assess LLMs’ understanding and application of security principles.

SecQA is organized into two versions: v1 and v2. Both are implemented here in both zero-shot and 5-shot settings. Version 1 is designed to assess foundational understanding, presenting questions that cover basic concepts and widely recognized security principles. It serves as a preliminary test to gauge LLMs’ basic comprehension and application of security knowledge. Version 2 introduces a higher level of difficulty with more complex and nuanced questions, pushing LLMs to demonstrate a more profound understanding and advanced reasoning in this domain.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Then, evaluate against one or more models with:

inspect eval inspect_evals/sec_qa_v1 --model openai/gpt-4o
inspect eval inspect_evals/sec_qa_v1_5_shot --model openai/gpt-4o
inspect eval inspect_evals/sec_qa_v2 --model openai/gpt-4o
inspect eval inspect_evals/sec_qa_v2_5_shot --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/sec_qa_v1 --limit 10
inspect eval inspect_evals/sec_qa_v1_5_shot --max-connections 10
inspect eval inspect_evals/sec_qa_v2 --temperature 0.5

See inspect eval --help for all available options.

Dataset

Here is an example from the dataset:

Question: What is the purpose of implementing a Guest Wireless Network in a corporate environment?

  1. To provide unrestricted access to company resources

  2. To offer a separate, secure network for visitors

  3. To replace the primary corporate wireless network

  4. To bypass network security protocols

Correct Answer: B

The model is tasked to choose one of the four options.

Scoring

A simple accuracy is calculated over the datapoints.