SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Knowledge

The Scientific Knowledge Evaluation benchmark is inspired by the profound principles outlined in the “Doctrine of the Mean” from ancient Chinese philosophy. This benchmark is designed to assess LLMs based on their proficiency in Studying Extensively, Enquiring Earnestly, Thinking Profoundly, Discerning Clearly, and Practicing Assiduously. Each of these dimensions offers a unique perspective on evaluating the capabilities of LLMs in handling scientific knowledge.

Overview

The Scientific Knowledge Evaluation (SciKnowEval) benchmark for Large Language Models (LLMs) is inspired by the profound principles outlined in the “Doctrine of the Mean” from ancient Chinese philosophy. This benchmark is designed to assess LLMs based on their proficiency in Studying Extensively, Enquiring Earnestly, Thinking Profoundly, Discerning Clearly, and Practicing Assiduously. Each of these dimensions offers a unique perspective on evaluating the capabilities of LLMs in handling scientific knowledge.

Usage

Install the eval-specific packages:

pip install -e ".[sciknoweval]"

domain can be one of the four values: biology, chemistry, material, physics.
level can be one of the five values: l1, l2, l3, l4, l5.
task can be find in the Task Mapping session.

This will run all the tasks under L4 in Biology domain.

inspect eval inspect_evals/sciknoweval -T domain=biology -T level=l4

This will run the task chemical_protocol_procedure_design in Chemistry domain.

inspect eval inspect_evals/sciknoweval -T domain=chemistry -T task=chemical_protocol_procedure_design

This will run all the tasks in Material domain.

inspect eval inspect_evals/sciknoweval -T domain=material

For model-graded tasks, you also need to specify the grader parameter, e.g. --model-role grader=openai/deepseek-v3-250324. By default, gpt-4.1-2025-04-14 is used. ## Environment Variables Example

OPENAI_API_KEY=xxx
INSPECT_EVAL_MODEL=openai/gpt-4.1-2025-04-14

Dataset

The dataset has been updated after the implementation, however the paper is not updated accordingly. So for now we keep the task mapping in sync with the old dataset version (v1). Note: Due to the dataset’s size, loading all samples and starting the eval may take 1-2 minutes when running on the entire dataset. During this time inspect’s terminal UI may also load slowly and be unresponsive.

Task Mapping

This is for referencing task names when using task parameter in the command.

Domain Level Task
Biology L1 biological_literature_qa
protein_property_identification
protein_captioning
L2 drug_drug_relation_extraction
biomedical_judgment_and_interpretation
compound_disease_relation_extraction
gene_disease_relation_extraction
detailed_understanding
text_summary
hypothesis_verification
reasoning_and_interpretation
L3 solubility_prediction
beta_lactamase_activity_prediction
fluorescence_prediction
gb1_fitness_prediction
stability_prediction
protein_protein_interaction
biological_calculation
L4 biological_harmful_qa
proteotoxicity_prediction
biological_laboratory_safety_test
L5 biological_protocol_procedure_design
biological_protocol_reagent_design
protein_design
single_cell_analysis
Chemistry L1 molecular_name_conversion
molecular_property_identification
chemical_literature_qa
molecular_captioning
L2 reaction_mechanism_inference
compound_identification_and_properties
doping_extraction
detailed_understanding
text_summary
hypothesis_verification
reasoning_and_interpretation
L3 molar_weight_calculation
molecular_property_calculation
molecular_structure_prediction
reaction_prediction
retrosynthesis
balancing_chemical_equation
chemical_calculation
L4 chemical_harmful_qa
molecular_toxicity_prediction
chemical_laboratory_safety_test
L5 molecular_generation
chemical_protocol_procedure_design
chemical_protocol_reagent_design
Material L1 material_literature_qa
L2 chemical_composition_extraction
digital_data_extraction
detailed_understanding
text_summary
hypothesis_verification
reasoning_and_interpretation
L3 valence_electron_difference_calculation
material_calculation
lattice_volume_calculation
perovskite_stability_prediction
diffusion_rate_analysis
L4 material_safety_qa
material_toxicity_prediction
L5 properties_utilization_analysis
crystal_structure_and_composition_analysis
specified_band_gap_material_generation
Physics L1 physics_literature_qa
fundamental_physics_exam
L2 detailed_understanding
text_summary
hypothesis_verification
reasoning_and_interpretation
L3 high_school_physics_calculation
general_physics_calculation
physics_formula_derivation
L4 physics_safety_qa
laboratory_safety_test
L5 physics_problem_solving

Scoring

Since there are various types of tasks in the dataset (e.g. True or False, Multiple Choice, Relation Extraction), we use different scoring methods. The choice of scorers corresponds to metrics field in the task_mapping.py file.
For tasks with metrics value GEN_BLEU_ROUGE and GEN_MOLECULE, the implementation in Inspect is taking an average of result scores, which are directly returned in a dict in the original code.