PaperBench: Evaluating AI’s Ability to Replicate AI Research (Work In Progress)

Coding
Agent
AI R&D

Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must reproduce the paper’s key results by writing and executing code.

Note: This eval is a work in progress. See https://github.com/UKGovernmentBEIS/inspect_evals/issues/334 for status.

Overview

Note: This eval is a work in progress. See https://github.com/UKGovernmentBEIS/inspect_evals/issues/334 for status.

PaperBench evaluates AI agents on their ability to reproduce the results of machine learning research papers.

Paper: https://arxiv.org/abs/2504.01848

Overview

Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must reproduce the paper’s key results by writing and executing code.

Original implementation: https://github.com/openai/frontier-evals/tree/main/project/paperbench

The instructions.txt is copied from the version of openai/frontier-evals repo in December 2025

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package via:

pip install inspect_evals

If you are using Inspect Evals in its repository, start by installing the dependencies with:

uv sync

Usage

PaperBench uses two tasks: paperbench runs the agent to reproduce papers, and paperbench_score scores the submissions in a fresh sandbox.

# Step 1: Run agent on papers (saves submissions to ./submissions/)
inspect eval inspect_evals/paperbench --model <model>

# Step 2: Score submissions in fresh sandbox
inspect eval inspect_evals/paperbench_score --model openai/o3-mini-2025-01-31 -T judge_type=simple

Both tasks use ./submissions/ by default, so they work together automatically. This two-task design ensures reproduce.sh runs in a fresh sandbox, matching the original PaperBench implementation’s approach of testing reproduction in a clean environment.

Models

PaperBench involves multiple models at different stages:

Model Stage Purpose
Solver model paperbench task The agent that reads the paper and attempts to reproduce it by writing code
Judge model paperbench_score task Evaluates submissions against the rubric. Used for (1) ranking relevant files and (2) scoring each criterion

The --model flag sets the model for each task: the solver model for paperbench, and the judge model for paperbench_score.

Important: To compare results across different solver models, use the same judge model for scoring. The original PaperBench paper uses openai/o3-mini-2025-01-31 as the judge.

Task Options

# Run on dev split (3 papers for development/testing)
inspect eval inspect_evals/paperbench --model <model> -T split=dev

# Run on specific papers
inspect eval inspect_evals/paperbench --model <model> -T paper_ids=adaptive-pruning
inspect eval inspect_evals/paperbench --model <model> -T paper_ids='["adaptive-pruning", "robust-clip"]'

# Task-level limits
inspect eval inspect_evals/paperbench --model <model> \
  -T max_time_in_hours=12 \
  -T message_limit=50 \
  -T token_limit=1000000

# Custom submissions directory
inspect eval inspect_evals/paperbench --model <model> -T submissions_dir=./my-submissions
inspect eval inspect_evals/paperbench_score --model <model> -T submissions_dir=./my-submissions

# Solver options (configure the default solver)
inspect eval inspect_evals/paperbench --model <model> \
  --solver inspect_evals/paperbench_solver \
  -S max_attempts=3 \
  -S tool_timeout=600

# Scoring task options
inspect eval inspect_evals/paperbench_score --model <model> \
  -T reproduce_timeout_hours=12 \
  -T retry_threshold_minutes=5

# Sandbox type (both tasks support this)
inspect eval inspect_evals/paperbench --model <model> -T sandbox_type=docker
inspect eval inspect_evals/paperbench_score --model <model> -T sandbox_type=docker

Splits

Split Papers Description
None 23 All available papers (default)
prod 20 Papers with results reported in the PaperBench paper
dev 3 Development papers for testing

Pre-downloading Dataset

To download all paper files to cache before running the eval:

from inspect_evals.paperbench.dataset import paperbench_dataset
paperbench_dataset()  # Downloads all 23 papers to cache

Files are cached in ~/.cache/inspect_evals/paperbench/ (Linux/macOS) by default.

Structure

paperbench/
├── dataset.py                 # HuggingFace dataset loader
├── images/
│   ├── Dockerfile             # Sandbox environment
│   ├── compose.yaml
│   └── agent.env.example      # Template for agent.env (API keys)
├── instructions.txt           # Agent instructions template
├── score/                     # Scoring logic
├── paperbench.py              # Inspect Task definition
└── solvers.py                 # Default agent solver

Paper files (PDF, addendum, rubric, etc.) are automatically downloaded from HuggingFace and cached locally.

Sandbox Environment

The agent runs in a Docker container with:

  • Python 3.12 (Debian-based)
  • Basic tools: git, curl, wget, build-essential, ffmpeg
  • Working directories: /home/paper (read-only inputs), /home/submission (agent output)

API Keys

To provide API keys to the agent sandbox:

  1. Copy the sample env file:

    cp src/inspect_evals/paperbench/images/agent.env.example src/inspect_evals/paperbench/images/agent.env
  2. Fill in your API keys in agent.env:

    OPENAI_BASE_URL=https://api.openai.com/v1
    OPENAI_API_KEY=...
    HF_TOKEN=...

The agent.env file is gitignored and will not be committed. These environment variables will be available inside the sandbox container.

Differences with Original PaperBench Implementation

Feature Our Environment Original
Docker-in-Docker Not supported Agents can build/run Docker containers in the sandbox
API Keys Copied to file and set as environment variables Only copied to file

Docker-in-Docker: The original implementation maps the docker.sock file from the host machine into the sandbox. While this allows the agent to test its reproduction script, it also allows the agent to escape the sandbox by issuing commands directly to the host’s Docker daemon. The authors might have had another layer of defense for this, but typical users don’t. Therefore, we’ve decided to leave this feature out.

Solver

The default solver (paperbench_solver) uses Inspect’s react() agent and tools for maintainability. It has similar prompts to the original paper implementation, but differ in a few ways. The table below outlines the differences.

Limits are enforced at the Task level: time_limit (default 6 hours, set from max_time_in_hours), message_limit (default 100), and token_limit (default unlimited).

Differences from Original PaperBench Implementation

Our implementation differ with the original basic_agent_plus in the following ways:

Feature Our Solver Original
Implementation Inspect’s react() agent Custom agent loop
Tools bash, python, text_editor, web_browser bash, python, read_file_chunk, search_file, web_browser
Limits Task-level (time, message, token) Custom real-time tracking with retry time subtraction
Context management truncation="auto" Message pruning (preserves 70%)
Periodic time limit and git reminders None Periodic updates every 5 steps
Model-specific None Special handling for o3/o4 and anthropic models

Our solver prioritizes simplicity and maintainability while the original optimizes for long-running evaluations with progress feedback.

Also note that the basic_iterative_agent is mostly a basic_agent_plus without a submit tool and a different continuation prompt. So, most of the above still apply.

Scoring

Scoring happens in a separate task (paperbench_score) that:

  1. Loads submissions from ./submissions/
  2. Runs reproduce.sh in a fresh sandbox
  3. Uses a judge to evaluate results against the rubric

Judge Types

Judge Type Description
dummy Returns 0 for all scores (default, for testing)
simple LLM-based judge that evaluates each rubric criterion

To use the LLM-based judge:

inspect eval inspect_evals/paperbench_score --model openai/o3-mini-2025-01-31 -T judge_type=simple

Note: The simple judge only supports OpenAI models due to its use of tiktoken for token counting.

See SCORING_DESIGN.md for the scoring architecture.

Agent Requirements

The agent must create a reproduce.sh script in /home/submission that:

  • Installs dependencies
  • Runs the reproduction code
  • Is executable via bash reproduce.sh

Changelog

[1.0.1] - 2025-12-18

  • Adds backoff policy for functions that connect to huggingface servers.