FrontierScience: Expert-Level Scientific Reasoning

Knowledge

Evaluates AI capabilities for expert-level scientific reasoning across physics, chemistry, and biology. Contains 160 problems with two evaluation formats: Olympic (100 samples with reference answers) and Research (60 samples with rubrics).

Contributed By

@tommyly201

@mnarayan

Code

src/inspect_evals/frontierscience

Paper

https://openai.com/index/frontierscience/

Overview

FrontierScience is a benchmark introduced by OpenAI in December 2025 that evaluates AI capabilities for expert-level scientific reasoning across physics, chemistry, and biology.

The benchmark contains 160 problems with LaTeX-formatted questions and answers, requiring step-by-step scientific reasoning. Problems range from constrained theoretical questions to open-ended multi-step research tasks.

Contributed by @tommyly201, @mnarayan

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

# Run all samples (both olympic and research formats)
uv run inspect eval inspect_evals/frontierscience --model openai/gpt-4o

# Run only olympic format (100 samples)
uv run inspect eval inspect_evals/frontierscience --model openai/gpt-4o -T format=olympic

# Run only research format (60 samples)
uv run inspect eval inspect_evals/frontierscience --model openai/gpt-4o -T format=research

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.frontierscience import frontierscience

# Run all formats
eval(frontierscience())

# Run only olympic format
eval(frontierscience(format="olympic"))

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-sonnet-4-20250514
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/frontierscience --limit 10
uv run inspect eval inspect_evals/frontierscience --max-connections 10

See uv run inspect eval --help for all available options.

Parameters

`frontierscience`

format (Optional[Literal[‘olympic’, ‘research’]]): Filter by format. “olympic” for reference answer problems, “research” for rubric-based problems. If None (default), includes both. (default: None)
subjects (Union[list[Literal[‘physics’, ‘chemistry’, ‘biology’]], Literal[‘physics’, ‘chemistry’, ‘biology’]]): Filter by subject(s). Can be a single subject string or a list of subjects. Valid values: “physics”, “chemistry”, “biology”. If empty (default), all subjects are included. (default: [])
grader_model (str | Model | None): Model to use for grading answer equivalence. If None, uses the default model. Recommended to use a capable model for accurate grading of complex scientific answers. (default: None)
shuffle (bool): Whether to shuffle the dataset order. (default: True)

Dataset

The dataset contains 160 expert-level scientific problems from the openai/frontierscience HuggingFace repository.

The benchmark includes two evaluation formats:

Olympic (100 samples): Problems with short reference answers (LaTeX expressions, formulas, values). Graded as correct/incorrect based on semantic equivalence.
Research (60 samples): Open-ended problems with detailed rubrics specifying point allocations.

The format is auto-detected based on whether the answer field contains rubric markers (“Points:” and “, Item:”).

Each sample includes metadata with:

subject: The subject area (physics, chemistry, or biology)
task_group_id: UUID for grouping related problems
format: The detected format (“olympic” or “research”)

Example (Olympic format)

Problem: Consider a dielectric block with polarization (P) and internal energy (U) which obeys…

Reference Answer: \(U=CT\)

Example (Research format)

Problem: Describe the procedure for Weak Value Amplification…

Rubric: Points: 1.5, Item: Background Theory - Qualitatively describes the procedure… (0.5pts) States that the meter and ancilla are weakly coupled…

Scoring

This implementation uses format-specific scorers following the official FrontierScience paper:

Olympic Format

Uses model_graded_fact() with the official FrontierScience Olympic judge prompt
Handles LaTeX expressions, numerical rounding, decimal place, chemical formulas, unit equivalence
Returns CORRECT or INCORRECT (binary scoring)
Metrics: accuracy, stderr

Research Format

Uses custom research_scorer() with the official FrontierScience Research judge prompt
Evaluates against detailed rubrics with point allocations (0-10 scale)
Returns normalized score (0-1 scale, with raw points in metadata)
Metric: mean

Important: For fair comparison across different model evaluations, use the same grader_model for all runs. The original FrontierScience paper uses GPT-5 at high reasoning effort as the model judge.

Evaluation Report

Validation Results (GPT-5.2)

Model	Format	Samples	Epochs	Score	Stderr	Time
openai/gpt-5.2	Olympic	15	3	60.0% accuracy	0.109	1m 32s
openai/gpt-5.2	Research	15	3	27.3% mean	-	9m 02s

# Olympic format
uv run inspect eval inspect_evals/frontierscience \
  --model openai/gpt-5.2 \
  -T format=olympic \
  -T grader_model=openai/gpt-5.2 \
  --limit 15 \
  --epochs 3

# Research format  
uv run inspect eval inspect_evals/frontierscience \
  --model openai/gpt-5.2 \
  -T format=research \
  -T grader_model=openai/gpt-5.2 \
  --limit 15 \
  --epochs 3

Validation Status:

Results align closely with OpenAI’s reported benchmarks
Olympic: Our 60.0% (3 trials) vs OpenAI’s 77.1% (20 trials)
Research: Our 27.3% (3 trials) vs OpenAI’s 25.2% (30 trials)

Model Coverage:

Validation performed on GPT-5.2 (best-performing model in OpenAI’s paper) to understand the maximum capacity of reasoning model
Contributions welcome: Please add results for other models (GPT-4o, Claude, etc.) to this table if you run the eval

Technical Notes:

OpenAI paper uses 20 independent trials for Olympic, 30 for Research
This validation uses 3 trials for both formats (sufficient for implementation verification)
Grader consistency: Keep grader_model constant across all evaluations for fair model comparison and original paper uses GPT-5 at high reasoning effort as the model judge

Implementation Details:

Olympic: Uses model_graded_fact() with official FrontierScience Olympic judge prompt
Research: Uses custom research_scorer() with official FrontierScience Research judge prompt
Temperature: Default (1.0) for compatibility with reasoning models
Version: 0.1.0

Known Limitations

Research Format Scoring

Heterogeneous rubric discretization: Research format rubrics have variable numbers of sub-items per question (some have 3 items, others have 10+), making cross-question score comparisons less meaningful.
Aggregated scores only: Current implementation returns a single aggregated score (0-10) per question rather than item-level scores for each rubric bullet point.
Statistical aggregation: Aggregating scores across epochs and judge samples may benefit from hierarchical modeling to properly account for the nested structure (items within questions, questions within subjects).

Future Work

The following improvements are planned for future iterations:

Item-Level Scoring for Research Format
- Return vector of scores for each rubric sub-item instead of single aggregate
- Track total number of sub-questions and max points per item
- Enable more granular analysis of model performance
Multiple Model Scorer (Ensemble Judging)
- Use multiple models (Gemini, OpenAI, Claude, etc.) as judges and aggregate judgments
- Improve scoring consistency and reduce single-model bias
Hierarchical Bayesian Modeling
- Properly aggregate scores across variable rubric granularity
- Account for heterogeneous discretization in Research format
- More rigorous statistical treatment of multi-epoch, multi-judge scenarios

Contributed by @tommyly201, @mnarayan

Changelog

[1-A] - 2026-02-16

Migrate version to new scheme. See #907.