scBench: A Benchmark for Single-Cell RNA-seq Analysis

Coding

inspect-evals

Agent

Evaluates whether models can solve practical single-cell RNA-seq analysis tasks with deterministic grading. Tasks require empirical interaction with .h5ad data files — agents must load and analyze the data to produce correct answers. Covers 30 canonical tasks across 5 sequencing platforms and 7 task categories.

Contributed By

@retroam

Code

src/inspect_evals/scbench

Paper

https://arxiv.org/abs/2602.09063

Overview

scBench evaluates whether models can solve practical single-cell RNA-seq analysis tasks with deterministic grading. Tasks require empirical interaction with .h5ad data files — agents must load and analyze the data to produce correct answers.

This implementation ships the public canonical subset (30 tasks across 5 platforms) from the upstream latchbio/scbench repository.

The full 394-task benchmark is not publicly released, so this eval focuses on the reproducible public slice.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[scbench]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra scbench

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/scbench --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.scbench import scbench
eval(scbench)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/scbench --limit 10
uv run inspect eval inspect_evals/scbench --max-connections 10
uv run inspect eval inspect_evals/scbench --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`scbench`

agent (ScBenchAgent | None): Agent harness to use, matching the original scBench CLI agent_registry. "minisweagent" reproduces the paper’s published results. "claudecode" and "openaicodex" use their respective CLI tools. Requires inspect-swe. When None (default), uses minisweagent if inspect-swe is installed, otherwise falls back to a react()-based solver. (default: None)
solver (Agent | None): Custom solver override. When set, agent is ignored. (default: None)
scorer (Scorer | None): Custom scorer). (default: None)
platforms (ScBenchPlatform | list[ScBenchPlatform] | None): Optional platform filter (single value or list). (default: None)
task_categories (ScBenchTaskCategory | list[ScBenchTaskCategory] | None): Optional normalized task-category filter. (default: None)
shuffle (bool): Whether to shuffle sample order before evaluation. (default: True)
sandbox (SandboxEnvironmentType): Sandbox environment for code execution. (default: ('docker', 'src/inspect_evals/scbench/compose.yaml'))
message_limit (int | None): Maximum number of messages in the agent loop. None (default) lets the agent manage its own loop termination. (default: None)
tool_timeout (int): Timeout in seconds for bash tool calls (only used by the react() fallback solver). (default: 300)
timeout (int): Task-level timeout in seconds. (default: 600)

Architecture

This evaluation uses an agentic solver running inside a Docker sandbox, matching the original scBench design where agents must empirically analyze data.

Agent harnesses

The agent parameter selects the agent harness via inspect-swe, which runs the actual upstream agent binary inside the Docker sandbox and proxies model API calls back through Inspect:

Agent	inspect_swe solver	Description
`minisweagent`	`mini_swe_agent()`	The mini-SWE-agent harness used for the paper’s published results. Bash-only, code-block extraction, completion marker detection — no `submit()` tool.
`claudecode`	`claude_code()`	Claude Code CLI with built-in tools and system prompt.
`openaicodex`	`codex_cli()`	Codex CLI with built-in tools and system prompt.

When inspect-swe is not installed, a fallback solver uses react() with bash() and submit() tools.

Other components

Sandbox: Docker container with scanpy, anndata, matplotlib, and the scientific Python stack, plus uv and Node.js for agent installation
Data: .h5ad files downloaded into the sandbox at runtime via a setup solver (_setup_sample_data) that fetches from HuggingFace
Scoring: Deterministic graders read answers exclusively from eval_answer.json written by the agent

Dataset download and timeouts

Each sample’s .h5ad data file is downloaded from HuggingFace into the sandbox at the start of the sample, before the agent begins working. However, the task-level time_limit (default 600s) includes this download time. On a cold cache, downloads average ~4 minutes and can take up to ~10 minutes for larger files, which may exhaust the timeout before the agent starts.

To avoid this, increase the timeout when running the evaluation:

uv run inspect eval inspect_evals/scbench --time-limit 1200

Once files are cached locally by HuggingFace, subsequent runs will not be affected.

Dataset

The bundled dataset is sourced from scBench canonical JSON files under:

src/inspect_evals/scbench/data/evals_canonical

It contains 30 deterministic eval definitions covering:

Platforms: Chromium, CSGenetics, Illumina, MissionBio, ParseBio
Task categories: QC, normalization, dimensionality reduction, clustering, cell typing, differential expression, trajectory analysis

Each sample includes:

A natural-language task prompt requiring data analysis
A data_node reference (Latch URIs mapped to public downloads via data_manifest.py)
A deterministic grader config

Scoring

This implementation reproduces the deterministic grader families used by scBench:

numeric_tolerance
multiple_choice
marker_gene_precision_recall
label_set_jaccard
distribution_comparison

Answers are extracted exclusively from eval_answer.json written by the agent in the sandbox working directory, matching the original harness behavior. No text-parsing fallback is used.

Reported metrics are:

accuracy
stderr

Evaluation Report

Evaluation version: 1-A

Model	Accuracy	Stderr
openai/gpt-5.1-2025-11-13	43.3%	9.2%
google/gemini-3-pro-preview	36.7%	8.9%
anthropic/claude-sonnet-4-5-20250929	30.0%	8.5%

uv run inspect eval inspect_evals/scbench \
  --model openai/gpt-5.1-2025-11-13,anthropic/claude-sonnet-4-5-20250929,google/gemini-3-pro-preview \
  -T shuffle=True --max-tasks 3

Reproducing Paper Results

The scBench paper runs each model × evaluation pair 3 times and averages outcomes with t-distribution confidence intervals. To reproduce this statistical design:

uv run inspect eval inspect_evals/scbench --model <model> --epochs 3

The paper’s published results use the full 394-task benchmark under the mini-SWE-agent harness (-T agent=minisweagent). Scores on the 30-task canonical subset are expected to differ.

Changelog

[2-A] - 2026-04-22

Fix data download filename: use DataFile.filename instead of Path(uri).name which incorrectly extracted the Latch URI path segment.

Known Differences from Original Harness

With `inspect-swe` installed (recommended)

When using agent=minisweagent (the default with inspect-swe), the actual upstream mini-SWE-agent binary runs inside the sandbox, providing close parity with the original harness. Remaining differences:

Area	Difference
Execution environment	Docker sandbox (reproducible, isolated) vs. original’s bare-metal `LocalEnvironment`. Docker provides resource limits (4 CPUs, 16 GB RAM) that the original does not impose.
Output truncation	Inspect truncates tool output at 16 KiB by default. Original uses 1 MB total cap with 1000 char/line truncation.
Timeout semantics	Inspect’s task timeout may allow in-flight tool calls to complete before enforcement, whereas the original uses a hard `SIGALRM` wall-clock cutoff.
Network access	The custom `compose.yaml` allows network access (matching the paper). Switching to auto-generated compose configs may restrict network and break package installation.

Without `inspect-swe` (fallback solver)

The react()-based fallback introduces additional differences:

Area	Difference
Agent loop	`react()` with `bash()` and `submit()` tools differs from mini-SWE-agent which extracts fenced code blocks via regex.
Continue/retry messages	The fallback injects user-turn messages that the original harness never produces.
Tool interface	The model sees an Inspect `bash()` tool definition instead of generating raw fenced code blocks.