CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge
Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
Overview
CyberMetric is a collection of four datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity: Disaster Recovery and BCP, Identity and Access Management (IAM), IoT Security, Cryptography, Wireless Security, Network Security, Cloud Security, Penetration Testing, and Compliance/Audit.
Contributed by @neilshaabi
Usage
First, install the dependencies:
uv sync
Then, evaluate against one or more models with:
uv run inspect eval inspect_evals/cybermetric_80 --model openai/gpt-5-nano
uv run inspect eval inspect_evals/cybermetric_500 --model openai/gpt-5-nano
uv run inspect eval inspect_evals/cybermetric_2000 --model openai/gpt-5-nano
uv run inspect eval inspect_evals/cybermetric_10000 --model openai/gpt-5-nano
To run multiple tasks simulteneously use inspect eval-set
:
uv run inspect eval-set inspect_evals/cybermetric_80 inspect_evals/cybermetric_500 inspect_evals/cybermetric_2000 inspect_evals/cybermetric_10000
You can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval, eval_set
from inspect_evals.cybermetric import cybermetric_80, cybermetric_500, cybermetric_2000, cybermetric_10000
eval(cybermetric_80)
='logs-run-42') eval_set([cybermetric_80, cybermetric_500, cybermetric_2000, cybermetric_10000], log_dir
After running evaluations, you can view their logs using the inspect view
command:
uv run inspect view
For VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/cybermetric_80 --limit 10
uv run inspect eval inspect_evals/cybermetric_500 --max-connections 10
uv run inspect eval inspect_evals/cybermetric_2000 --temperature 0.5
See uv run inspect eval --help
for all available options.
Dataset
Here is an example prompt from the dataset (after it has been further processed by Inspect):
Answer the following multiple choice question. The entire content of your response should be of the following format: ‘ANSWER: $LETTER’ (without quotes) where LETTER is one of A,B,C,D.
In cryptography, what is the purpose of using a key-derivation function (KDF)?
- Generate public keys, B) Authenticate digital signatures, C) Encrypt data using a password, D) KDF are algorithms used to transform a secret into crucial parameters like keys and Initialization Vectors (IVs).
The model is then tasked to pick the correct choice.
Scoring
A simple accuracy is calculated over the datapoints.