PaperBench: Evaluating AI’s Ability to Replicate AI Research (Work In Progress)

Coding
Agent
AI R&D

Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must reproduce the paper’s key results by writing and executing code.

Note: This eval is a work in progress. See https://github.com/UKGovernmentBEIS/inspect_evals/issues/334 for status.

Overview

Note: This eval is a work in progress. See https://github.com/UKGovernmentBEIS/inspect_evals/issues/334 for status.

PaperBench evaluates AI agents on their ability to reproduce the results of machine learning research papers.

Paper: https://arxiv.org/abs/2504.01848

Overview

Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must reproduce the paper’s key results by writing and executing code.

Original implementation: https://github.com/openai/frontier-evals/tree/main/project/paperbench

The instructions.txt is copied from the version of openai/frontier-evals repo in December 2025

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package via:

pip install inspect_evals

If you are using Inspect Evals in its repository, start by installing the dependencies with:

uv sync

Usage

# Run on all papers (23 papers, default)
inspect eval inspect_evals/paperbench --model <model>

# Run on dev split (3 papers for development/testing)
inspect eval inspect_evals/paperbench --model <model> -T split=dev

# Run on specific papers
inspect eval inspect_evals/paperbench --model <model> -T paper_ids=adaptive-pruning
inspect eval inspect_evals/paperbench --model <model> -T paper_ids='["adaptive-pruning", "robust-clip"]'


# Task-level limits
inspect eval inspect_evals/paperbench --model <model> \
  -T max_time_in_hours=12 \
  -T message_limit=50 \
  -T token_limit=1000000

# Solver options (configure the default solver)
inspect eval inspect_evals/paperbench --model <model> \
  --solver inspect_evals/paperbench_solver \
  -S max_attempts=3 \
  -S tool_timeout=600

Splits

Split Papers Description
None 23 All available papers (default)
prod 20 Papers with results reported in the PaperBench paper
dev 3 Development papers for testing

Pre-downloading Dataset

To download all paper files to cache before running the eval:

from inspect_evals.paperbench.dataset import paperbench_dataset
paperbench_dataset()  # Downloads all 23 papers to cache

Files are cached in ~/.cache/inspect_evals/paperbench/ (Linux/macOS) by default.

Structure

paperbench/
├── dataset.py                 # HuggingFace dataset loader
├── images/
│   ├── Dockerfile             # Sandbox environment
│   ├── compose.yaml
│   └── agent.env.example      # Template for agent.env (API keys)
├── instructions.txt           # Agent instructions template
├── score/                     # Scoring logic
├── paperbench.py              # Inspect Task definition
└── solvers.py                 # Default agent solver

Paper files (PDF, addendum, rubric, etc.) are automatically downloaded from HuggingFace and cached locally.

Solver

The default solver (paperbench_solver) uses Inspect’s react() agent and tools for maintainability. It has similar prompts to the original paper implementation, but differ in a few ways. The table below outlines the differences.

Limits are enforced at the Task level: time_limit (set from max_time_in_hours), message_limit (default 100), and token_limit (default unlimited).

Differences from Original PaperBench Implementation

Our implementation differ with the original basic_agent_plus in the following ways:

Feature Our Solver Original
Implementation Inspect’s react() agent Custom agent loop
Tools bash, python, text_editor, web_browser bash, python, read_file_chunk, search_file, web_browser
Limits Task-level (time, message, token) Custom real-time tracking with retry time subtraction
Context management truncation="auto" Message pruning (preserves 70%)
Periodic time limit and git reminders None Periodic updates every 5 steps
Model-specific None Special handling for o3/o4 and anthropic models

Our solver prioritizes simplicity and maintainability while the original optimizes for long-running evaluations with progress feedback.

Also note that the basic_iterative_agent is mostly a basic_agent_plus without a submit tool and a different continuation prompt. So, most of the above still apply.

Sandbox Environment

The agent runs in a Docker container with:

  • Python 3.12 (Debian-based)
  • Basic tools: git, curl, wget, build-essential, ffmpeg
  • Working directories: /home/paper (read-only inputs), /home/submission (agent output)

API Keys

To provide API keys to the agent sandbox:

  1. Copy the sample env file:

    cp src/inspect_evals/paperbench/images/agent.env.example src/inspect_evals/paperbench/images/agent.env
  2. Fill in your API keys in agent.env:

    OPENAI_BASE_URL=https://api.openai.com/v1
    OPENAI_API_KEY=...
    HF_TOKEN=...

The agent.env file is gitignored and will not be committed. These environment variables will be available inside the sandbox container.

Differences with Original PaperBench Implementation

Feature Our Environment Original
Docker-in-Docker Not supported Agents can build/run Docker containers in the sandbox
API Keys Copied to file and set as environment variables Only copied to file
Reproduction Runs in same sandbox as agent Runs in a fresh container

Docker-in-Docker: The original implementation maps the docker.sock file from the host machine into the sandbox. While this allows the agent to test its reproduction script, it also allows the agent to escape the sandbox by issuing commands directly to the host’s Docker daemon. The authors might have had another layer of defense for this, but typical users don’t. Therefore, we’ve decided to leave this feature out.

Reproduction Sandbox: The original implementation spins up a fresh container for running reproduce.sh, ensuring a clean environment. Our implementation runs reproduction in the same sandbox where the agent worked. This means leftover state (installed packages, environment variables, modified files) may affect reproduction results.

Scorer

Currently uses a placeholder scorer. A full rubric-based LLM judge scorer is planned. See SCORING_DESIGN.md for the scoring architecture.

Agent Requirements

The agent must create a reproduce.sh script in /home/submission that:

  • Installs dependencies
  • Runs the reproduction code
  • Is executable via bash reproduce.sh