PaperBench: Evaluating AI’s Ability to Replicate AI Research (Work In Progress)

Coding

Agent

AI R&D

Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must reproduce the paper’s key results by writing and executing code.

Note: This eval is a work in progress. See https://github.com/UKGovernmentBEIS/inspect_evals/issues/334 for status.

Contributed By

@vhong-aisi

Code

src/inspect_evals/paperbench

Paper

https://arxiv.org/abs/2504.01848

Overview

Note: This eval is a work in progress. See https://github.com/UKGovernmentBEIS/inspect_evals/issues/334 for status.

PaperBench evaluates AI agents on their ability to reproduce the results of machine learning research papers.

Paper: https://arxiv.org/abs/2504.01848

Overview

Original implementation: https://github.com/openai/frontier-evals/tree/main/project/paperbench

The instructions.txt is copied from the version of openai/frontier-evals repo in December 2025

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package via:

pip install inspect_evals[paperbench]

If you are using Inspect Evals in its repository, start by installing the dependencies with:

uv sync --extra paperbench

Usage

PaperBench uses two tasks: paperbench runs the agent to reproduce papers, and paperbench_score scores the submissions in a fresh sandbox.

# Step 1: Run agent on papers (saves submissions to ./submissions/)
inspect eval inspect_evals/paperbench --model <model>

# Step 2: Score submissions in fresh sandbox
inspect eval inspect_evals/paperbench_score --model openai/o3-mini-2025-01-31 -T judge_type=simple

Both tasks use ./submissions/ by default, so they work together automatically. This two-task design ensures reproduce.sh runs in a fresh sandbox, matching the original PaperBench implementation’s approach of testing reproduction in a clean environment.

Models

PaperBench involves multiple models at different stages:

Model	Stage	Purpose
Solver model	`paperbench` task	The agent that reads the paper and attempts to reproduce it by writing code
Judge model	`paperbench_score` task	Evaluates submissions against the rubric. Used for (1) ranking relevant files and (2) scoring each criterion

The --model flag sets the model for each task: the solver model for paperbench, and the judge model for paperbench_score.

Important: To compare results across different solver models, use the same judge model for scoring. The original PaperBench paper uses openai/o3-mini-2025-01-31 as the judge.

Task Options

# Run on dev split (3 papers for development/testing)
inspect eval inspect_evals/paperbench --model <model> -T split=dev

# Run on specific papers
inspect eval inspect_evals/paperbench --model <model> -T paper_ids=adaptive-pruning
inspect eval inspect_evals/paperbench --model <model> -T paper_ids='["adaptive-pruning", "robust-clip"]'

# Task-level limits
inspect eval inspect_evals/paperbench --model <model> \
  -T max_time_in_hours=12 \
  -T message_limit=50 \
  -T token_limit=1000000

# Custom submissions directory
inspect eval inspect_evals/paperbench --model <model> -T submissions_dir=./my-submissions
inspect eval inspect_evals/paperbench_score --model <model> -T submissions_dir=./my-submissions

# Solver options (configure the default solver)
inspect eval inspect_evals/paperbench --model <model> \
  --solver inspect_evals/paperbench_solver \
  -S max_attempts=3 \
  -S tool_timeout=600

# Scoring task options
inspect eval inspect_evals/paperbench_score --model <model> \
  -T reproduce_timeout_hours=12 \
  -T retry_threshold_minutes=5

# Sandbox type (both tasks support this)
inspect eval inspect_evals/paperbench --model <model> -T sandbox_type=docker
inspect eval inspect_evals/paperbench_score --model <model> -T sandbox_type=docker

Splits

Split	Papers	Description
`None`	23	All available papers (default)
`prod`	20	Papers with results reported in the PaperBench paper
`dev`	3	Development papers for testing

Pre-downloading Dataset

To download all paper files to cache before running the eval:

from inspect_evals.paperbench.dataset import paperbench_dataset
paperbench_dataset()  # Downloads all 23 papers to cache

Files are cached in ~/.cache/inspect_evals/paperbench/ (Linux/macOS) by default.

Parameters

`paperbench`

paper_ids (str | list[str] | None): Paper ID(s) to reproduce (mutually exclusive with split). (default: None)
split (Optional[Literal[‘dev’, ‘prod’]]): “dev” (3 papers), “prod” (20 papers), or None for all 23. (default: None)
solver (Solver | None): Optional solver to use. (default: None)
submissions_dir (str): Directory to save submissions. (default: './submissions')
max_time_in_hours (float): Time limit for the agent in hours. (default: 6.0)
message_limit (int | None): Maximum messages per sample (default 100, None for unlimited). (default: 100)
token_limit (int | None): Maximum tokens per sample (None for unlimited). (default: None)
sandbox_type (Literal[‘docker’]): Sandbox type to use. (default: 'docker')

Structure

paperbench/
├── dataset.py                 # HuggingFace dataset loader
├── images/
│   ├── Dockerfile             # Sandbox environment
│   ├── compose.yaml
│   └── agent.env.example      # Template for agent.env (API keys)
├── instructions.txt           # Agent instructions template
├── score/                     # Scoring logic
├── paperbench.py              # Inspect Task definition
└── solvers.py                 # Default agent solver

Paper files (PDF, addendum, rubric, etc.) are automatically downloaded from HuggingFace and cached locally.

Sandbox Environment

The agent runs in a Docker container with:

Python 3.12 (Debian-based)
Basic tools: git, curl, wget, build-essential, ffmpeg
Working directories: /home/paper (read-only inputs), /home/submission (agent output)

API Keys

To provide API keys to the agent sandbox:

Copy the sample env file:

cp src/inspect_evals/paperbench/images/agent.env.example src/inspect_evals/paperbench/images/agent.env

Fill in your API keys in agent.env:

OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=...
HF_TOKEN=...

The agent.env file is gitignored and will not be committed. These environment variables will be available inside the sandbox container.

Differences with Original PaperBench Implementation

Feature	Our Environment	Original
Docker-in-Docker	Not supported	Agents can build/run Docker containers in the sandbox
API Keys	Copied to file and set as environment variables	Only copied to file

Docker-in-Docker: The original implementation maps the docker.sock file from the host machine into the sandbox. While this allows the agent to test its reproduction script, it also allows the agent to escape the sandbox by issuing commands directly to the host’s Docker daemon. The authors might have had another layer of defense for this, but typical users don’t. Therefore, we’ve decided to leave this feature out.

Solver

The default solver (paperbench_solver) uses Inspect’s react() agent and tools for maintainability. It has similar prompts to the original paper implementation, but differ in a few ways. The table below outlines the differences.

Limits are enforced at the Task level: time_limit (default 6 hours, set from max_time_in_hours), message_limit (default 100), and token_limit (default unlimited).

Differences from Original PaperBench Implementation

Our implementation differ with the original basic_agent_plus in the following ways:

Feature	Our Solver	Original
Implementation	Inspect’s `react()` agent	Custom agent loop
Tools	bash, python, text_editor, web_browser	bash, python, read_file_chunk, search_file, web_browser
Limits	Task-level (time, message, token)	Custom real-time tracking with retry time subtraction
Context management	`truncation="auto"`	Message pruning (preserves 70%)
Periodic time limit and git reminders	None	Periodic updates every 5 steps
Model-specific	None	Special handling for o3/o4 and anthropic models

Our solver prioritizes simplicity and maintainability while the original optimizes for long-running evaluations with progress feedback.

Also note that the basic_iterative_agent is mostly a basic_agent_plus without a submit tool and a different continuation prompt. So, most of the above still apply.

Scoring

Scoring happens in a separate task (paperbench_score) that:

Loads submissions from ./submissions/
Runs reproduce.sh in a fresh sandbox
Uses a judge to evaluate results against the rubric

Judge Types

Judge Type	Description
`dummy`	Returns 0 for all scores (default, for testing)
`simple`	LLM-based judge that evaluates each rubric criterion

To use the LLM-based judge:

inspect eval inspect_evals/paperbench_score --model openai/o3-mini-2025-01-31 -T judge_type=simple

Note: The simple judge only supports OpenAI models due to its use of tiktoken for token counting.

See SCORING_DESIGN.md for the scoring architecture.

Agent Requirements

The agent must create a reproduce.sh script in /home/submission that:

Installs dependencies
Runs the reproduction code
Is executable via bash reproduce.sh

Changelog

[2-A] - 2026-02-16

Migrate version to new scheme. See #907.

[1.0.1] - 2025-12-18

Adds backoff policy for functions that connect to huggingface servers.