scBench: A Benchmark for Single-Cell RNA-seq Analysis
Evaluates whether models can solve practical single-cell RNA-seq analysis tasks with deterministic grading. Tasks require empirical interaction with .h5ad data files — agents must load and analyze the data to produce correct answers. Covers 30 canonical tasks across 5 sequencing platforms and 7 task categories.
Overview
scBench evaluates whether models can solve practical single-cell RNA-seq analysis tasks with deterministic grading. Tasks require empirical interaction with .h5ad data files — agents must load and analyze the data to produce correct answers.
This implementation ships the public canonical subset (30 tasks across 5 platforms) from the upstream latchbio/scbench repository.
The full 394-task benchmark is not publicly released, so this eval focuses on the reproducible public slice.
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evals[scbench]If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv sync --extra scbenchRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/scbench --model openai/gpt-5-nanoYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval
from inspect_evals.scbench import scbench
eval(scbench)After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/scbench --limit 10
uv run inspect eval inspect_evals/scbench --max-connections 10
uv run inspect eval inspect_evals/scbench --temperature 0.5See uv run inspect eval --help for all available options.
Parameters
scbench
agent(ScBenchAgent | None): Agent harness to use, matching the original scBench CLI agent_registry."minisweagent"reproduces the paper’s published results."claudecode"and"openaicodex"use their respective CLI tools. Requiresinspect-swe. When None (default), usesminisweagentif inspect-swe is installed, otherwise falls back to a react()-based solver. (default:None)solver(Agent | None): Custom solver override. When set,agentis ignored. (default:None)scorer(Scorer | None): Custom scorer). (default:None)platforms(ScBenchPlatform | list[ScBenchPlatform] | None): Optional platform filter (single value or list). (default:None)task_categories(ScBenchTaskCategory | list[ScBenchTaskCategory] | None): Optional normalized task-category filter. (default:None)shuffle(bool): Whether to shuffle sample order before evaluation. (default:True)sandbox(SandboxEnvironmentType): Sandbox environment for code execution. (default:('docker', 'src/inspect_evals/scbench/compose.yaml'))message_limit(int | None): Maximum number of messages in the agent loop. None (default) lets the agent manage its own loop termination. (default:None)tool_timeout(int): Timeout in seconds for bash tool calls (only used by the react() fallback solver). (default:300)timeout(int): Task-level timeout in seconds. (default:600)
Architecture
This evaluation uses an agentic solver running inside a Docker sandbox, matching the original scBench design where agents must empirically analyze data.
Agent harnesses
The agent parameter selects the agent harness via inspect-swe, which runs the actual upstream agent binary inside the Docker sandbox and proxies model API calls back through Inspect:
| Agent | inspect_swe solver | Description |
|---|---|---|
minisweagent |
mini_swe_agent() |
The mini-SWE-agent harness used for the paper’s published results. Bash-only, code-block extraction, completion marker detection — no submit() tool. |
claudecode |
claude_code() |
Claude Code CLI with built-in tools and system prompt. |
openaicodex |
codex_cli() |
Codex CLI with built-in tools and system prompt. |
When inspect-swe is not installed, a fallback solver uses react() with bash() and submit() tools.
Other components
- Sandbox: Docker container with scanpy, anndata, matplotlib, and the scientific Python stack, plus
uvand Node.js for agent installation - Data:
.h5adfiles downloaded into the sandbox at runtime via a setup solver (_setup_sample_data) that fetches from HuggingFace - Scoring: Deterministic graders read answers exclusively from
eval_answer.jsonwritten by the agent
Dataset download and timeouts
Each sample’s .h5ad data file is downloaded from HuggingFace into the sandbox at the start of the sample, before the agent begins working. However, the task-level time_limit (default 600s) includes this download time. On a cold cache, downloads average ~4 minutes and can take up to ~10 minutes for larger files, which may exhaust the timeout before the agent starts.
To avoid this, increase the timeout when running the evaluation:
uv run inspect eval inspect_evals/scbench --time-limit 1200Once files are cached locally by HuggingFace, subsequent runs will not be affected.
Dataset
The bundled dataset is sourced from scBench canonical JSON files under:
src/inspect_evals/scbench/data/evals_canonical
It contains 30 deterministic eval definitions covering:
- Platforms: Chromium, CSGenetics, Illumina, MissionBio, ParseBio
- Task categories: QC, normalization, dimensionality reduction, clustering, cell typing, differential expression, trajectory analysis
Each sample includes:
- A natural-language task prompt requiring data analysis
- A
data_nodereference (Latch URIs mapped to public downloads viadata_manifest.py) - A deterministic grader config
Scoring
This implementation reproduces the deterministic grader families used by scBench:
numeric_tolerancemultiple_choicemarker_gene_precision_recalllabel_set_jaccarddistribution_comparison
Answers are extracted exclusively from eval_answer.json written by the agent in the sandbox working directory, matching the original harness behavior. No text-parsing fallback is used.
Reported metrics are:
accuracystderr
Evaluation Report
Evaluation version: 1-A
| Model | Accuracy | Stderr |
|---|---|---|
| openai/gpt-5.1-2025-11-13 | 43.3% | 9.2% |
| google/gemini-3-pro-preview | 36.7% | 8.9% |
| anthropic/claude-sonnet-4-5-20250929 | 30.0% | 8.5% |
uv run inspect eval inspect_evals/scbench \
--model openai/gpt-5.1-2025-11-13,anthropic/claude-sonnet-4-5-20250929,google/gemini-3-pro-preview \
-T shuffle=True --max-tasks 3Reproducing Paper Results
The scBench paper runs each model × evaluation pair 3 times and averages outcomes with t-distribution confidence intervals. To reproduce this statistical design:
uv run inspect eval inspect_evals/scbench --model <model> --epochs 3The paper’s published results use the full 394-task benchmark under the mini-SWE-agent harness (-T agent=minisweagent). Scores on the 30-task canonical subset are expected to differ.
Known Differences from Original Harness
With inspect-swe installed (recommended)
When using agent=minisweagent (the default with inspect-swe), the actual upstream mini-SWE-agent binary runs inside the sandbox, providing close parity with the original harness. Remaining differences:
| Area | Difference |
|---|---|
| Execution environment | Docker sandbox (reproducible, isolated) vs. original’s bare-metal LocalEnvironment. Docker provides resource limits (4 CPUs, 16 GB RAM) that the original does not impose. |
| Output truncation | Inspect truncates tool output at 16 KiB by default. Original uses 1 MB total cap with 1000 char/line truncation. |
| Timeout semantics | Inspect’s task timeout may allow in-flight tool calls to complete before enforcement, whereas the original uses a hard SIGALRM wall-clock cutoff. |
| Network access | The custom compose.yaml allows network access (matching the paper). Switching to auto-generated compose configs may restrict network and break package installation. |
Without inspect-swe (fallback solver)
The react()-based fallback introduces additional differences:
| Area | Difference |
|---|---|
| Agent loop | react() with bash() and submit() tools differs from mini-SWE-agent which extracts fenced code blocks via regex. |
| Continue/retry messages | The fallback injects user-turn messages that the original harness never produces. |
| Tool interface | The model sees an Inspect bash() tool definition instead of generating raw fenced code blocks. |