PaperBench: Evaluating AI’s Ability to Replicate AI Research (Work In Progress)
Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must reproduce the paper’s key results by writing and executing code.
Note: This eval is a work in progress. See https://github.com/UKGovernmentBEIS/inspect_evals/issues/334 for status.
Overview
Note: This eval is a work in progress. See https://github.com/UKGovernmentBEIS/inspect_evals/issues/334 for status.
PaperBench evaluates AI agents on their ability to reproduce the results of machine learning research papers.
Paper: https://arxiv.org/abs/2504.01848
Overview
Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must reproduce the paper’s key results by writing and executing code.
Original implementation: https://github.com/openai/frontier-evals/tree/main/project/paperbench
The instructions.txt is copied from the version of openai/frontier-evals repo in December 2025
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package via:
pip install inspect_evalsIf you are using Inspect Evals in its repository, start by installing the dependencies with:
uv syncUsage
PaperBench uses two tasks: paperbench runs the agent to reproduce papers, and paperbench_score scores the submissions in a fresh sandbox.
# Step 1: Run agent on papers (saves submissions to ./submissions/)
inspect eval inspect_evals/paperbench --model <model>
# Step 2: Score submissions in fresh sandbox
inspect eval inspect_evals/paperbench_score --model openai/o3-mini-2025-01-31 -T judge_type=simpleBoth tasks use ./submissions/ by default, so they work together automatically. This two-task design ensures reproduce.sh runs in a fresh sandbox, matching the original PaperBench implementation’s approach of testing reproduction in a clean environment.
Models
PaperBench involves multiple models at different stages:
| Model | Stage | Purpose |
|---|---|---|
| Solver model | paperbench task |
The agent that reads the paper and attempts to reproduce it by writing code |
| Judge model | paperbench_score task |
Evaluates submissions against the rubric. Used for (1) ranking relevant files and (2) scoring each criterion |
The --model flag sets the model for each task: the solver model for paperbench, and the judge model for paperbench_score.
Important: To compare results across different solver models, use the same judge model for scoring. The original PaperBench paper uses openai/o3-mini-2025-01-31 as the judge.
Task Options
# Run on dev split (3 papers for development/testing)
inspect eval inspect_evals/paperbench --model <model> -T split=dev
# Run on specific papers
inspect eval inspect_evals/paperbench --model <model> -T paper_ids=adaptive-pruning
inspect eval inspect_evals/paperbench --model <model> -T paper_ids='["adaptive-pruning", "robust-clip"]'
# Task-level limits
inspect eval inspect_evals/paperbench --model <model> \
-T max_time_in_hours=12 \
-T message_limit=50 \
-T token_limit=1000000
# Custom submissions directory
inspect eval inspect_evals/paperbench --model <model> -T submissions_dir=./my-submissions
inspect eval inspect_evals/paperbench_score --model <model> -T submissions_dir=./my-submissions
# Solver options (configure the default solver)
inspect eval inspect_evals/paperbench --model <model> \
--solver inspect_evals/paperbench_solver \
-S max_attempts=3 \
-S tool_timeout=600
# Scoring task options
inspect eval inspect_evals/paperbench_score --model <model> \
-T reproduce_timeout_hours=12 \
-T retry_threshold_minutes=5
# Sandbox type (both tasks support this)
inspect eval inspect_evals/paperbench --model <model> -T sandbox_type=docker
inspect eval inspect_evals/paperbench_score --model <model> -T sandbox_type=dockerSplits
| Split | Papers | Description |
|---|---|---|
None |
23 | All available papers (default) |
prod |
20 | Papers with results reported in the PaperBench paper |
dev |
3 | Development papers for testing |
Pre-downloading Dataset
To download all paper files to cache before running the eval:
from inspect_evals.paperbench.dataset import paperbench_dataset
paperbench_dataset() # Downloads all 23 papers to cacheFiles are cached in ~/.cache/inspect_evals/paperbench/ (Linux/macOS) by default.
Structure
paperbench/
├── dataset.py # HuggingFace dataset loader
├── images/
│ ├── Dockerfile # Sandbox environment
│ ├── compose.yaml
│ └── agent.env.example # Template for agent.env (API keys)
├── instructions.txt # Agent instructions template
├── score/ # Scoring logic
├── paperbench.py # Inspect Task definition
└── solvers.py # Default agent solver
Paper files (PDF, addendum, rubric, etc.) are automatically downloaded from HuggingFace and cached locally.
Sandbox Environment
The agent runs in a Docker container with:
- Python 3.12 (Debian-based)
- Basic tools: git, curl, wget, build-essential, ffmpeg
- Working directories:
/home/paper(read-only inputs),/home/submission(agent output)
API Keys
To provide API keys to the agent sandbox:
Copy the sample env file:
cp src/inspect_evals/paperbench/images/agent.env.example src/inspect_evals/paperbench/images/agent.envFill in your API keys in
agent.env:OPENAI_BASE_URL=https://api.openai.com/v1 OPENAI_API_KEY=... HF_TOKEN=...
The agent.env file is gitignored and will not be committed. These environment variables will be available inside the sandbox container.
Differences with Original PaperBench Implementation
| Feature | Our Environment | Original |
|---|---|---|
| Docker-in-Docker | Not supported | Agents can build/run Docker containers in the sandbox |
| API Keys | Copied to file and set as environment variables | Only copied to file |
Docker-in-Docker: The original implementation maps the docker.sock file from the host machine into the sandbox. While this allows the agent to test its reproduction script, it also allows the agent to escape the sandbox by issuing commands directly to the host’s Docker daemon. The authors might have had another layer of defense for this, but typical users don’t. Therefore, we’ve decided to leave this feature out.
Solver
The default solver (paperbench_solver) uses Inspect’s react() agent and tools for maintainability. It has similar prompts to the original paper implementation, but differ in a few ways. The table below outlines the differences.
Limits are enforced at the Task level: time_limit (default 6 hours, set from max_time_in_hours), message_limit (default 100), and token_limit (default unlimited).
Differences from Original PaperBench Implementation
Our implementation differ with the original basic_agent_plus in the following ways:
| Feature | Our Solver | Original |
|---|---|---|
| Implementation | Inspect’s react() agent |
Custom agent loop |
| Tools | bash, python, text_editor, web_browser | bash, python, read_file_chunk, search_file, web_browser |
| Limits | Task-level (time, message, token) | Custom real-time tracking with retry time subtraction |
| Context management | truncation="auto" |
Message pruning (preserves 70%) |
| Periodic time limit and git reminders | None | Periodic updates every 5 steps |
| Model-specific | None | Special handling for o3/o4 and anthropic models |
Our solver prioritizes simplicity and maintainability while the original optimizes for long-running evaluations with progress feedback.
Also note that the basic_iterative_agent is mostly a basic_agent_plus without a submit tool and a different continuation prompt. So, most of the above still apply.
Scoring
Scoring happens in a separate task (paperbench_score) that:
- Loads submissions from
./submissions/ - Runs
reproduce.shin a fresh sandbox - Uses a judge to evaluate results against the rubric
Judge Types
| Judge Type | Description |
|---|---|
dummy |
Returns 0 for all scores (default, for testing) |
simple |
LLM-based judge that evaluates each rubric criterion |
To use the LLM-based judge:
inspect eval inspect_evals/paperbench_score --model openai/o3-mini-2025-01-31 -T judge_type=simpleNote: The simple judge only supports OpenAI models due to its use of tiktoken for token counting.
See SCORING_DESIGN.md for the scoring architecture.
Agent Requirements
The agent must create a reproduce.sh script in /home/submission that:
- Installs dependencies
- Runs the reproduction code
- Is executable via
bash reproduce.sh
Changelog
[1.0.1] - 2025-12-18
- Adds backoff policy for functions that connect to huggingface servers.