PaperBench: Evaluating AI’s Ability to Replicate AI Research (Work In Progress)
Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must reproduce the paper’s key results by writing and executing code.
Note: This eval is a work in progress. See https://github.com/UKGovernmentBEIS/inspect_evals/issues/334 for status.
Overview
Note: This eval is a work in progress. See https://github.com/UKGovernmentBEIS/inspect_evals/issues/334 for status.
PaperBench evaluates AI agents on their ability to reproduce the results of machine learning research papers.
Paper: https://arxiv.org/abs/2504.01848
Overview
Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must reproduce the paper’s key results by writing and executing code.
Original implementation: https://github.com/openai/frontier-evals/tree/main/project/paperbench
The instructions.txt is copied from the version of openai/frontier-evals repo in December 2025
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package via:
pip install inspect_evalsIf you are using Inspect Evals in its repository, start by installing the dependencies with:
uv syncUsage
# Run on all papers (23 papers, default)
inspect eval inspect_evals/paperbench --model <model>
# Run on dev split (3 papers for development/testing)
inspect eval inspect_evals/paperbench --model <model> -T split=dev
# Run on specific papers
inspect eval inspect_evals/paperbench --model <model> -T paper_ids=adaptive-pruning
inspect eval inspect_evals/paperbench --model <model> -T paper_ids='["adaptive-pruning", "robust-clip"]'
# Task-level limits
inspect eval inspect_evals/paperbench --model <model> \
-T max_time_in_hours=12 \
-T message_limit=50 \
-T token_limit=1000000
# Solver options (configure the default solver)
inspect eval inspect_evals/paperbench --model <model> \
--solver inspect_evals/paperbench_solver \
-S max_attempts=3 \
-S tool_timeout=600Splits
| Split | Papers | Description |
|---|---|---|
None |
23 | All available papers (default) |
prod |
20 | Papers with results reported in the PaperBench paper |
dev |
3 | Development papers for testing |
Pre-downloading Dataset
To download all paper files to cache before running the eval:
from inspect_evals.paperbench.dataset import paperbench_dataset
paperbench_dataset() # Downloads all 23 papers to cacheFiles are cached in ~/.cache/inspect_evals/paperbench/ (Linux/macOS) by default.
Structure
paperbench/
├── dataset.py # HuggingFace dataset loader
├── images/
│ ├── Dockerfile # Sandbox environment
│ ├── compose.yaml
│ └── agent.env.example # Template for agent.env (API keys)
├── instructions.txt # Agent instructions template
├── score/ # Scoring logic
├── paperbench.py # Inspect Task definition
└── solvers.py # Default agent solver
Paper files (PDF, addendum, rubric, etc.) are automatically downloaded from HuggingFace and cached locally.
Solver
The default solver (paperbench_solver) uses Inspect’s react() agent and tools for maintainability. It has similar prompts to the original paper implementation, but differ in a few ways. The table below outlines the differences.
Limits are enforced at the Task level: time_limit (set from max_time_in_hours), message_limit (default 100), and token_limit (default unlimited).
Differences from Original PaperBench Implementation
Our implementation differ with the original basic_agent_plus in the following ways:
| Feature | Our Solver | Original |
|---|---|---|
| Implementation | Inspect’s react() agent |
Custom agent loop |
| Tools | bash, python, text_editor, web_browser | bash, python, read_file_chunk, search_file, web_browser |
| Limits | Task-level (time, message, token) | Custom real-time tracking with retry time subtraction |
| Context management | truncation="auto" |
Message pruning (preserves 70%) |
| Periodic time limit and git reminders | None | Periodic updates every 5 steps |
| Model-specific | None | Special handling for o3/o4 and anthropic models |
Our solver prioritizes simplicity and maintainability while the original optimizes for long-running evaluations with progress feedback.
Also note that the basic_iterative_agent is mostly a basic_agent_plus without a submit tool and a different continuation prompt. So, most of the above still apply.
Sandbox Environment
The agent runs in a Docker container with:
- Python 3.12 (Debian-based)
- Basic tools: git, curl, wget, build-essential, ffmpeg
- Working directories:
/home/paper(read-only inputs),/home/submission(agent output)
API Keys
To provide API keys to the agent sandbox:
Copy the sample env file:
cp src/inspect_evals/paperbench/images/agent.env.example src/inspect_evals/paperbench/images/agent.envFill in your API keys in
agent.env:OPENAI_BASE_URL=https://api.openai.com/v1 OPENAI_API_KEY=... HF_TOKEN=...
The agent.env file is gitignored and will not be committed. These environment variables will be available inside the sandbox container.
Differences with Original PaperBench Implementation
| Feature | Our Environment | Original |
|---|---|---|
| Docker-in-Docker | Not supported | Agents can build/run Docker containers in the sandbox |
| API Keys | Copied to file and set as environment variables | Only copied to file |
| Reproduction | Runs in same sandbox as agent | Runs in a fresh container |
Docker-in-Docker: The original implementation maps the docker.sock file from the host machine into the sandbox. While this allows the agent to test its reproduction script, it also allows the agent to escape the sandbox by issuing commands directly to the host’s Docker daemon. The authors might have had another layer of defense for this, but typical users don’t. Therefore, we’ve decided to leave this feature out.
Reproduction Sandbox: The original implementation spins up a fresh container for running reproduce.sh, ensuring a clean environment. Our implementation runs reproduction in the same sandbox where the agent worked. This means leftover state (installed packages, environment variables, modified files) may affect reproduction results.
Scorer
Currently uses a placeholder scorer. A full rubric-based LLM judge scorer is planned. See SCORING_DESIGN.md for the scoring architecture.
Agent Requirements
The agent must create a reproduce.sh script in /home/submission that:
- Installs dependencies
- Runs the reproduction code
- Is executable via
bash reproduce.sh