LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
LAB-Bench 2 evaluates language models and research agents on life-science tasks spanning literature reasoning, database access, figures, tables, protocols, source quality, sequence analysis, cloning, patents, and clinical trials. A single parameterized task selects one or more dataset tags and a file-delivery mode, scoring each subset with an LLM judge or a deterministic validator, and can be run with a bare, server-side tools, or sandboxed agentic solver.
Overview
⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.
Source: Generality-Labs/lab-bench@081864a
LAB-Bench 2 evaluates language models and research agents on life-science tasks spanning literature reasoning, database access, figures, tables, protocols, source quality, sequence analysis, cloning, patents, and clinical trials. A single parameterized task selects one or more dataset tags and a file-delivery mode, scoring each subset with an LLM judge or a deterministic validator, and can be run with a bare, server-side tools, or sandboxed agentic solver.
Usage
Installation
This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:
git clone https://github.com/Generality-Labs/lab-bench
cd lab-bench
git checkout 081864af494b180ecf6aae3f7333e384c0d227af
uv syncRunning evaluations
CLI
uv run inspect eval src/lab_bench_2/lab_bench_2.py@lab_bench_2 --model openai/gpt-5-nanoPython
from inspect_ai import eval
from lab_bench_2.lab_bench_2 import lab_bench_2
eval(lab_bench_2(), model="openai/gpt-5-nano")View logs
uv run inspect viewMore information
For the dataset, scorer, task parameters, and validation, see the upstream repo: Generality-Labs/lab-bench.
Options
You can control a variety of options from the command line. For example:
uv run inspect eval src/lab_bench_2/lab_bench_2.py@lab_bench_2 --limit 10
uv run inspect eval src/lab_bench_2/lab_bench_2.py@lab_bench_2 --max-connections 10
uv run inspect eval src/lab_bench_2/lab_bench_2.py@lab_bench_2 --temperature 0.5See uv run inspect eval --help for all available options.
More command-line options: Inspect docs ↗
Evaluation Report
Timestamp: June 2026
Commit: 25113ef Version: 1-A
uv run inspect eval lab_bench_2 -T tags=<tag> -T mode=<mode> -T solver=<solver> --model openai/gpt-5.2| Model | Provider | Paper Acc | Accuracy | Stderr | Time |
|---|---|---|---|---|---|
| openai/gpt-5.2 (litqa3 · tools) | OpenAI | 0.815 | 0.799 | 0.032 | 9m59s |
| openai/gpt-5.2 (dbqa2 · bare) | OpenAI | 0.070 | 0.058 | 0.025 | 1m51s |
| openai/gpt-5.2 (cloning · inject · tools) | OpenAI | 0.286 | 0.286 | 0.125 | 1h41m39s |
| openai/gpt-5.2 (figqa2-img · file · bare) | OpenAI | 0.564 | 0.525 | 0.050 | 1m10s |
| openai/gpt-5.2 (seqqa2 · retrieve · bare) | OpenAI | 0.095 | 0.115 | 0.023 | 2m31s |
| openai/gpt-5.2 (protocolqa2 · file · agentic) | OpenAI | 0.416 | 0.447 | 0.045 | 6m18s |
Notes:
- Model under test: openai/gpt-5.2. LLM-judge tags (litqa3, dbqa2, figqa2-img, protocolqa2) are graded by anthropic/claude-sonnet-4-5; deterministic tags (seqqa2, cloning) use a deterministic scorer.
- The
paper_acccolumn is the reference accuracy from the original LAB-Bench 2 report, matched per row to the nearest paper run configuration. Each row runs its full set of mode-compatible samples. - Grader refusals are excluded as unscored rather than counted wrong: litqa3 accuracy/stderr are over 159 of 168 scored samples; protocolqa2 over 123 of 125. cloning’s low score is partly a refusal artifact — the model declined ~40% of the 14-question set on biosecurity grounds.
- Dataset: EdisonScientific/labbench2, split train, pinned to revision 27d12d72af24e3f70db8a99df63e567366cbdb80. @tools,high rows use –reasoning-effort high.