SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models
The Scientific Knowledge Evaluation benchmark is inspired by the profound principles outlined in the “Doctrine of the Mean” from ancient Chinese philosophy. This benchmark is designed to assess LLMs based on their proficiency in Studying Extensively, Enquiring Earnestly, Thinking Profoundly, Discerning Clearly, and Practicing Assiduously. Each of these dimensions offers a unique perspective on evaluating the capabilities of LLMs in handling scientific knowledge.
Overview
The Scientific Knowledge Evaluation (SciKnowEval) benchmark for Large Language Models (LLMs) is inspired by the profound principles outlined in the “Doctrine of the Mean” from ancient Chinese philosophy. This benchmark is designed to assess LLMs based on their proficiency in Studying Extensively, Enquiring Earnestly, Thinking Profoundly, Discerning Clearly, and Practicing Assiduously. Each of these dimensions offers a unique perspective on evaluating the capabilities of LLMs in handling scientific knowledge.
Usage
Install the eval-specific packages:
pip install -e ".[sciknoweval]"
domain
can be one of the four values: biology
, chemistry
, material
, physics
.
level
can be one of the five values: l1
, l2
, l3
, l4
, l5
.
task
can be find in the Task Mapping session.
This will run all the tasks under L4 in Biology domain.
inspect eval inspect_evals/sciknoweval -T domain=biology -T level=l4
This will run the task chemical_protocol_procedure_design
in Chemistry domain.
inspect eval inspect_evals/sciknoweval -T domain=chemistry -T task=chemical_protocol_procedure_design
This will run all the tasks in Material domain.
inspect eval inspect_evals/sciknoweval -T domain=material
For model-graded tasks, you also need to specify the grader parameter, e.g. --model-role grader=openai/deepseek-v3-250324
. By default, gpt-4.1-2025-04-14 is used. ## Environment Variables Example
OPENAI_API_KEY=xxx
INSPECT_EVAL_MODEL=openai/gpt-4.1-2025-04-14
Dataset
The dataset has been updated after the implementation, however the paper is not updated accordingly. So for now we keep the task mapping in sync with the old dataset version (v1). Note: Due to the dataset’s size, loading all samples and starting the eval may take 1-2 minutes when running on the entire dataset. During this time inspect’s terminal UI may also load slowly and be unresponsive.
Task Mapping
This is for referencing task names when using task
parameter in the command.
Domain | Level | Task |
---|---|---|
Biology | L1 | biological_literature_qa |
protein_property_identification | ||
protein_captioning | ||
L2 | drug_drug_relation_extraction | |
biomedical_judgment_and_interpretation | ||
compound_disease_relation_extraction | ||
gene_disease_relation_extraction | ||
detailed_understanding | ||
text_summary | ||
hypothesis_verification | ||
reasoning_and_interpretation | ||
L3 | solubility_prediction | |
beta_lactamase_activity_prediction | ||
fluorescence_prediction | ||
gb1_fitness_prediction | ||
stability_prediction | ||
protein_protein_interaction | ||
biological_calculation | ||
L4 | biological_harmful_qa | |
proteotoxicity_prediction | ||
biological_laboratory_safety_test | ||
L5 | biological_protocol_procedure_design | |
biological_protocol_reagent_design | ||
protein_design | ||
single_cell_analysis | ||
Chemistry | L1 | molecular_name_conversion |
molecular_property_identification | ||
chemical_literature_qa | ||
molecular_captioning | ||
L2 | reaction_mechanism_inference | |
compound_identification_and_properties | ||
doping_extraction | ||
detailed_understanding | ||
text_summary | ||
hypothesis_verification | ||
reasoning_and_interpretation | ||
L3 | molar_weight_calculation | |
molecular_property_calculation | ||
molecular_structure_prediction | ||
reaction_prediction | ||
retrosynthesis | ||
balancing_chemical_equation | ||
chemical_calculation | ||
L4 | chemical_harmful_qa | |
molecular_toxicity_prediction | ||
chemical_laboratory_safety_test | ||
L5 | molecular_generation | |
chemical_protocol_procedure_design | ||
chemical_protocol_reagent_design | ||
Material | L1 | material_literature_qa |
L2 | chemical_composition_extraction | |
digital_data_extraction | ||
detailed_understanding | ||
text_summary | ||
hypothesis_verification | ||
reasoning_and_interpretation | ||
L3 | valence_electron_difference_calculation | |
material_calculation | ||
lattice_volume_calculation | ||
perovskite_stability_prediction | ||
diffusion_rate_analysis | ||
L4 | material_safety_qa | |
material_toxicity_prediction | ||
L5 | properties_utilization_analysis | |
crystal_structure_and_composition_analysis | ||
specified_band_gap_material_generation | ||
Physics | L1 | physics_literature_qa |
fundamental_physics_exam | ||
L2 | detailed_understanding | |
text_summary | ||
hypothesis_verification | ||
reasoning_and_interpretation | ||
L3 | high_school_physics_calculation | |
general_physics_calculation | ||
physics_formula_derivation | ||
L4 | physics_safety_qa | |
laboratory_safety_test | ||
L5 | physics_problem_solving |
Scoring
Since there are various types of tasks in the dataset (e.g. True or False, Multiple Choice, Relation Extraction), we use different scoring methods. The choice of scorers corresponds to metrics
field in the task_mapping.py
file.
For tasks with metrics
value GEN_BLEU_ROUGE
and GEN_MOLECULE
, the implementation in Inspect is taking an average of result scores, which are directly returned in a dict in the original code.