SOS BENCH: Benchmarking Safety Alignment on Scientific Knowledge

Knowledge

A regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas).

Overview

SOSBench is a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas).

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev]"

Then, evaluate against one or more models on a task with:

inspect eval inspect_evals/sosbench --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/sosbench --limit 10
inspect eval inspect_evals/sosbench --max-connections 10
inspect eval inspect_evals/sosbench --temperature 0.5

See inspect eval --help for all available options.

An optional domain parameter can be provided to run a specific domain. If not provided, all domains will be run.

inspect eval inspect_evals/sosbench -T domain=biology

The valid domains are: “biology”, “chemistry”, “medical”, “pharmacy”, “physics”, “psychology”.

Dataset

SOSBench encompasses 6 domains that include 500 questions each with open-ended questions (3,000 open-ended questions in total).

It could be downloaded from here in HuggingFace.

Solver and Completion

The GenerateConfig setting on max tokens is to align with original paper setting, “For each model, we prompt it with questions from our benchmark. For non-reasoning model, we set the maximum number of generated tokens to 512, and we scale the budget 10 times up to 5120 for reasoning models. For proprietary models disclosing reasoning, we take the reasoning thought as part of model response in evaluation.”

However, since the reasoning_tokens and reasoning_effort only take effect on certain family of models. This setting may not work exactly as the original setting does.

Scoring

As suggested in paper, “Our crafted LLM-Judge with GPT-4.1 models shows best consistency with human annotations. Therefore we deploy this evaluator as the safety indicator function”. Thus, a custom model-judged scorer with a prompt template is used to judge is the output of testing model is considered safe(score=0) or not safe(score=1). A simple average metrics is calculated over the datapoints.

The default grading model is GPT-4.1, however you can specify another grading model using --model-role parameter. For example,

inspect eval inspect_evals/sosbench -T domain=biology --model-role grader=anthropic/claude-3-5-sonnet-20240620