SOS BENCH: Benchmarking Safety Alignment on Scientific Knowledge
A regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas).
Overview
SOSBench is a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas).
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Or, if developing on a clone of the inspect_evals
repo, you can install the package in editable mode with:
pip install -e ".[dev]"
Then, evaluate against one or more models on a task with:
inspect eval inspect_evals/sosbench --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/sosbench --limit 10
inspect eval inspect_evals/sosbench --max-connections 10
inspect eval inspect_evals/sosbench --temperature 0.5
See inspect eval --help
for all available options.
An optional domain
parameter can be provided to run a specific domain. If not provided, all domains will be run.
inspect eval inspect_evals/sosbench -T domain=biology
The valid domains are: “biology”, “chemistry”, “medical”, “pharmacy”, “physics”, “psychology”.
Dataset
SOSBench encompasses 6 domains that include 500 questions each with open-ended questions (3,000 open-ended questions in total).
It could be downloaded from here in HuggingFace.
Solver and Completion
The GenerateConfig
setting on max tokens is to align with original paper setting, “For each model, we prompt it with questions from our benchmark. For non-reasoning model, we set the maximum number of generated tokens to 512, and we scale the budget 10 times up to 5120 for reasoning models. For proprietary models disclosing reasoning, we take the reasoning thought as part of model response in evaluation.”
However, since the reasoning_tokens
and reasoning_effort
only take effect on certain family of models. This setting may not work exactly as the original setting does.
Scoring
As suggested in paper, “Our crafted LLM-Judge with GPT-4.1 models shows best consistency with human annotations. Therefore we deploy this evaluator as the safety indicator function”. Thus, a custom model-judged scorer with a prompt template is used to judge is the output of testing model is considered safe(score=0) or not safe(score=1). A simple average metrics is calculated over the datapoints.
The default grading model is GPT-4.1, however you can specify another grading model using --model-role
parameter. For example,
inspect eval inspect_evals/sosbench -T domain=biology --model-role grader=anthropic/claude-3-5-sonnet-20240620