LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

LAB-Bench 2 evaluates language models and research agents on life-science tasks spanning literature reasoning, database access, figures, tables, protocols, source quality, sequence analysis, cloning, patents, and clinical trials. A single parameterized task selects one or more dataset tags and a file-delivery mode, scoring each subset with an LLM judge or a deterministic validator, and can be run with a bare, server-side tools, or sandboxed agentic solver.

Maintained By

@iphan

Code

lab_bench_2

Paper

https://arxiv.org/abs/2604.09554

Overview

⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.

Source: Generality-Labs/lab-bench@081864a

Usage

Installation

This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:

git clone https://github.com/Generality-Labs/lab-bench
cd lab-bench
git checkout 081864af494b180ecf6aae3f7333e384c0d227af
uv sync

Running evaluations

CLI

uv run inspect eval src/lab_bench_2/lab_bench_2.py@lab_bench_2 --model openai/gpt-5-nano

Python

from inspect_ai import eval
from lab_bench_2.lab_bench_2 import lab_bench_2

eval(lab_bench_2(), model="openai/gpt-5-nano")

View logs

uv run inspect view

More information

For the dataset, scorer, task parameters, and validation, see the upstream repo: Generality-Labs/lab-bench.

Options

You can control a variety of options from the command line. For example:

uv run inspect eval src/lab_bench_2/lab_bench_2.py@lab_bench_2 --limit 10
uv run inspect eval src/lab_bench_2/lab_bench_2.py@lab_bench_2 --max-connections 10
uv run inspect eval src/lab_bench_2/lab_bench_2.py@lab_bench_2 --temperature 0.5

See uv run inspect eval --help for all available options.

More command-line options: Inspect docs ↗

Evaluation Report

Timestamp: June 2026

Commit: 25113ef Version: 1-A

uv run inspect eval lab_bench_2 -T tags=<tag> -T mode=<mode> -T solver=<solver> --model openai/gpt-5.2

Model	Provider	Paper Acc	Accuracy	Stderr	Time
openai/gpt-5.2 (litqa3 · tools)	OpenAI	0.815	0.799	0.032	9m59s
openai/gpt-5.2 (dbqa2 · bare)	OpenAI	0.070	0.058	0.025	1m51s
openai/gpt-5.2 (cloning · inject · tools)	OpenAI	0.286	0.286	0.125	1h41m39s
openai/gpt-5.2 (figqa2-img · file · bare)	OpenAI	0.564	0.525	0.050	1m10s
openai/gpt-5.2 (seqqa2 · retrieve · bare)	OpenAI	0.095	0.115	0.023	2m31s
openai/gpt-5.2 (protocolqa2 · file · agentic)	OpenAI	0.416	0.447	0.045	6m18s

Notes:

Model under test: openai/gpt-5.2. LLM-judge tags (litqa3, dbqa2, figqa2-img, protocolqa2) are graded by anthropic/claude-sonnet-4-5; deterministic tags (seqqa2, cloning) use a deterministic scorer.
The paper_acc column is the reference accuracy from the original LAB-Bench 2 report, matched per row to the nearest paper run configuration. Each row runs its full set of mode-compatible samples.
Grader refusals are excluded as unscored rather than counted wrong: litqa3 accuracy/stderr are over 159 of 168 scored samples; protocolqa2 over 123 of 125. cloning’s low score is partly a refusal artifact — the model declined ~40% of the 14-question set on biosecurity grounds.
Dataset: EdisonScientific/labbench2, split train, pinned to revision 27d12d72af24e3f70db8a99df63e567366cbdb80. @tools,high rows use –reasoning-effort high.