scBench: A Benchmark for Single-Cell RNA-seq Analysis

Coding
Agent

Evaluates whether models can solve practical single-cell RNA-seq analysis tasks with deterministic grading. Tasks require empirical interaction with .h5ad data files — agents must load and analyze the data to produce correct answers. Covers 30 canonical tasks across 5 sequencing platforms and 7 task categories.

Overview

scBench evaluates whether models can solve practical single-cell RNA-seq analysis tasks with deterministic grading. Tasks require empirical interaction with .h5ad data files — agents must load and analyze the data to produce correct answers.

This implementation ships the public canonical subset (30 tasks across 5 platforms) from the upstream latchbio/scbench repository.

The full 394-task benchmark is not publicly released, so this eval focuses on the reproducible public slice.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[scbench]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra scbench

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/scbench --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.scbench import scbench
eval(scbench)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/scbench --limit 10
uv run inspect eval inspect_evals/scbench --max-connections 10
uv run inspect eval inspect_evals/scbench --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

scbench

  • agent (ScBenchAgent | None): Agent harness to use, matching the original scBench CLI agent_registry. "minisweagent" reproduces the paper’s published results. "claudecode" and "openaicodex" use their respective CLI tools. Requires inspect-swe. When None (default), uses minisweagent if inspect-swe is installed, otherwise falls back to a react()-based solver. (default: None)
  • solver (Agent | None): Custom solver override. When set, agent is ignored. (default: None)
  • scorer (Scorer | None): Custom scorer). (default: None)
  • platforms (ScBenchPlatform | list[ScBenchPlatform] | None): Optional platform filter (single value or list). (default: None)
  • task_categories (ScBenchTaskCategory | list[ScBenchTaskCategory] | None): Optional normalized task-category filter. (default: None)
  • shuffle (bool): Whether to shuffle sample order before evaluation. (default: True)
  • sandbox (SandboxEnvironmentType): Sandbox environment for code execution. (default: ('docker', 'src/inspect_evals/scbench/compose.yaml'))
  • message_limit (int | None): Maximum number of messages in the agent loop. None (default) lets the agent manage its own loop termination. (default: None)
  • tool_timeout (int): Timeout in seconds for bash tool calls (only used by the react() fallback solver). (default: 300)
  • timeout (int): Task-level timeout in seconds. (default: 600)

Architecture

This evaluation uses an agentic solver running inside a Docker sandbox, matching the original scBench design where agents must empirically analyze data.

Agent harnesses

The agent parameter selects the agent harness via inspect-swe, which runs the actual upstream agent binary inside the Docker sandbox and proxies model API calls back through Inspect:

Agent inspect_swe solver Description
minisweagent mini_swe_agent() The mini-SWE-agent harness used for the paper’s published results. Bash-only, code-block extraction, completion marker detection — no submit() tool.
claudecode claude_code() Claude Code CLI with built-in tools and system prompt.
openaicodex codex_cli() Codex CLI with built-in tools and system prompt.

When inspect-swe is not installed, a fallback solver uses react() with bash() and submit() tools.

Other components

  • Sandbox: Docker container with scanpy, anndata, matplotlib, and the scientific Python stack, plus uv and Node.js for agent installation
  • Data: .h5ad files downloaded into the sandbox at runtime via a setup solver (_setup_sample_data) that fetches from HuggingFace
  • Scoring: Deterministic graders read answers exclusively from eval_answer.json written by the agent

Dataset download and timeouts

Each sample’s .h5ad data file is downloaded from HuggingFace into the sandbox at the start of the sample, before the agent begins working. However, the task-level time_limit (default 600s) includes this download time. On a cold cache, downloads average ~4 minutes and can take up to ~10 minutes for larger files, which may exhaust the timeout before the agent starts.

To avoid this, increase the timeout when running the evaluation:

uv run inspect eval inspect_evals/scbench --time-limit 1200

Once files are cached locally by HuggingFace, subsequent runs will not be affected.

Dataset

The bundled dataset is sourced from scBench canonical JSON files under:

  • src/inspect_evals/scbench/data/evals_canonical

It contains 30 deterministic eval definitions covering:

  • Platforms: Chromium, CSGenetics, Illumina, MissionBio, ParseBio
  • Task categories: QC, normalization, dimensionality reduction, clustering, cell typing, differential expression, trajectory analysis

Each sample includes:

  • A natural-language task prompt requiring data analysis
  • A data_node reference (Latch URIs mapped to public downloads via data_manifest.py)
  • A deterministic grader config

Scoring

This implementation reproduces the deterministic grader families used by scBench:

  • numeric_tolerance
  • multiple_choice
  • marker_gene_precision_recall
  • label_set_jaccard
  • distribution_comparison

Answers are extracted exclusively from eval_answer.json written by the agent in the sandbox working directory, matching the original harness behavior. No text-parsing fallback is used.

Reported metrics are:

  • accuracy
  • stderr

Evaluation Report

Evaluation version: 1-A

Model Accuracy Stderr
openai/gpt-5.1-2025-11-13 43.3% 9.2%
google/gemini-3-pro-preview 36.7% 8.9%
anthropic/claude-sonnet-4-5-20250929 30.0% 8.5%
uv run inspect eval inspect_evals/scbench \
  --model openai/gpt-5.1-2025-11-13,anthropic/claude-sonnet-4-5-20250929,google/gemini-3-pro-preview \
  -T shuffle=True --max-tasks 3

Reproducing Paper Results

The scBench paper runs each model × evaluation pair 3 times and averages outcomes with t-distribution confidence intervals. To reproduce this statistical design:

uv run inspect eval inspect_evals/scbench --model <model> --epochs 3

The paper’s published results use the full 394-task benchmark under the mini-SWE-agent harness (-T agent=minisweagent). Scores on the 30-task canonical subset are expected to differ.

Known Differences from Original Harness

Without inspect-swe (fallback solver)

The react()-based fallback introduces additional differences:

Area Difference
Agent loop react() with bash() and submit() tools differs from mini-SWE-agent which extracts fenced code blocks via regex.
Continue/retry messages The fallback injects user-turn messages that the original harness never produces.
Tool interface The model sees an Inspect bash() tool definition instead of generating raw fenced code blocks.