SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Knowledge

The Scientific Knowledge Evaluation benchmark is inspired by the profound principles outlined in the “Doctrine of the Mean” from ancient Chinese philosophy. This benchmark is designed to assess LLMs based on their proficiency in Studying Extensively, Enquiring Earnestly, Thinking Profoundly, Discerning Clearly, and Practicing Assiduously. Each of these dimensions offers a unique perspective on evaluating the capabilities of LLMs in handling scientific knowledge.

Contributed By

@Esther-Guo

Code

src/inspect_evals/sciknoweval

Paper

https://arxiv.org/abs/2406.09098v2

Overview

The Scientific Knowledge Evaluation (SciKnowEval) benchmark for Large Language Models (LLMs) is inspired by the profound principles outlined in the “Doctrine of the Mean” from ancient Chinese philosophy. This benchmark is designed to assess LLMs based on their proficiency in Studying Extensively, Enquiring Earnestly, Thinking Profoundly, Discerning Clearly, and Practicing Assiduously. Each of these dimensions offers a unique perspective on evaluating the capabilities of LLMs in handling scientific knowledge.

Note that this eval depends on gensim, which as of September 2025 is not compatible with python 3.13.

Usage

Install the eval-specific packages:

uv sync --extra sciknoweval

domain can be one of the four values: biology, chemistry, material, physics.
level can be one of the five values: l1, l2, l3, l4, l5.
task can be find in the Task Mapping session.

This will run all the tasks under L4 in Biology domain.

uv run inspect eval inspect_evals/sciknoweval -T domain=biology -T level=l4

This will run the task chemical_protocol_procedure_design in Chemistry domain.

uv run inspect eval inspect_evals/sciknoweval -T domain=chemistry -T task=chemical_protocol_procedure_design

This will run all the tasks in Material domain.

uv run inspect eval inspect_evals/sciknoweval -T domain=material

For model-graded tasks, you also need to specify the grader parameter, e.g. --model-role grader=openai/deepseek-v3-250324. By default, gpt-4.1-2025-04-14 is used.

Parameters

`sciknoweval`

domain (str | None): (default: None)
level (str | None): (default: None)
task (str | None): (default: None)
shuffle (bool): (default: True) ## Environment Variables Example

OPENAI_API_KEY=xxx
INSPECT_EVAL_MODEL=openai/gpt-4.1-2025-04-14

Dataset

The dataset has been updated after the implementation, however the paper is not updated accordingly. So for now we keep the task mapping in sync with the old dataset version (v1). Note: Due to the dataset’s size, loading all samples and starting the eval may take 1-2 minutes when running on the entire dataset. During this time inspect’s terminal UI may also load slowly and be unresponsive.

Task Mapping

This is for referencing task names when using task parameter in the command.

Domain	Level	Task
Biology	L1	biological_literature_qa
		protein_property_identification
		protein_captioning
	L2	drug_drug_relation_extraction
		biomedical_judgment_and_interpretation
		compound_disease_relation_extraction
		gene_disease_relation_extraction
		detailed_understanding
		text_summary
		hypothesis_verification
		reasoning_and_interpretation
	L3	solubility_prediction
		beta_lactamase_activity_prediction
		fluorescence_prediction
		gb1_fitness_prediction
		stability_prediction
		protein_protein_interaction
		biological_calculation
	L4	biological_harmful_qa
		proteotoxicity_prediction
		biological_laboratory_safety_test
	L5	biological_protocol_procedure_design
		biological_protocol_reagent_design
		protein_design
		single_cell_analysis
Chemistry	L1	molecular_name_conversion
		molecular_property_identification
		chemical_literature_qa
		molecular_captioning
	L2	reaction_mechanism_inference
		compound_identification_and_properties
		doping_extraction
		detailed_understanding
		text_summary
		hypothesis_verification
		reasoning_and_interpretation
	L3	molar_weight_calculation
		molecular_property_calculation
		molecular_structure_prediction
		reaction_prediction
		retrosynthesis
		balancing_chemical_equation
		chemical_calculation
	L4	chemical_harmful_qa
		molecular_toxicity_prediction
		chemical_laboratory_safety_test
	L5	molecular_generation
		chemical_protocol_procedure_design
		chemical_protocol_reagent_design
Material	L1	material_literature_qa
	L2	chemical_composition_extraction
		digital_data_extraction
		detailed_understanding
		text_summary
		hypothesis_verification
		reasoning_and_interpretation
	L3	valence_electron_difference_calculation
		material_calculation
		lattice_volume_calculation
		perovskite_stability_prediction
		diffusion_rate_analysis
	L4	material_safety_qa
		material_toxicity_prediction
	L5	properties_utilization_analysis
		crystal_structure_and_composition_analysis
		specified_band_gap_material_generation
Physics	L1	physics_literature_qa
		fundamental_physics_exam
	L2	detailed_understanding
		text_summary
		hypothesis_verification
		reasoning_and_interpretation
	L3	high_school_physics_calculation
		general_physics_calculation
		physics_formula_derivation
	L4	physics_safety_qa
		laboratory_safety_test
	L5	physics_problem_solving

Scoring

Since there are various types of tasks in the dataset (e.g. True or False, Multiple Choice, Relation Extraction), we use different scoring methods. The choice of scorers corresponds to metrics field in the task_mapping.py file.
For tasks with metrics value GEN_BLEU_ROUGE and GEN_MOLECULE, the implementation in Inspect is taking an average of result scores, which are directly returned in a dict in the original code.