PaperBench Scoring Design

This document describes the architecture for scoring PaperBench submissions in a clean sandbox environment.

Overview

The scorer includes 3 components:

Blacklist URL monitor - Check that the model doesn’t access any blacklisted URLs during solving
Reproduction - Run the reproduce.sh script that the model generated in a fresh environment
LLM judge - Take the reproduction results and score them according to the rubric

Problem

The current implementation runs reproduce.sh in the same sandbox where the agent worked. This means the reproduction might accidentally depend on:

System packages the agent installed globally
Files created outside the submission directory
Environment variables or state left behind

We want to verify that reproduce.sh works on a fresh machine with the same specs.

Solution

Split scoring into two tasks:

Task 1: Agent solves the paper, exports submission
Task 2: Fresh sandbox loads submission, runs reproduction, checks blacklist, judges results

Components

Component	Type	Task	Purpose
`paperbench_solver`	Solver	Task 1	Agent reproduces the paper
`save_submission`	Cleanup	Task 1	Export tarball + JSON metadata
`load_submission`	Setup	Task 2	Unpack tarball to sandbox
`reproduce`	Solver	Task 2	Run `reproduce.sh`
`blacklist_url_monitor`	Scorer	Task 2	Check agent didn’t access blacklisted URLs
`paperbench_judge_scorer`	Scorer	Task 2	LLM evaluates against rubric

Task Composition

# Task 1: Agent + Export
Task(
    solver=paperbench_solver(),
    cleanup=save_submission(submissions_output_dir="./submissions"),
)

# Task 2: Load + Reproduce + Judge
Task(
    dataset=load_manifest_dataset("./submissions"),
    setup=load_submission(submissions_output_dir="./submissions"),
    solver=reproduce(),
    scorer=[
        blacklist_url_monitor(),
        paperbench_judge_scorer(),
    ],
)

Data Flow

Task 1                                Task 2
┌─────────────────────────┐           ┌─────────────────────────┐
│ paperbench_solver       │           │ load_submission         │
│   Agent works on paper  │           │   Unpack tarball        │
│                         │           │                         │
│ save_submission         │  tarball  │ reproduce               │
│   Export submission ────┼───────────┼─► Run reproduce.sh      │
│   Write JSON metadata   │  + JSON   │                         │
└─────────────────────────┘           │ paperbench_judge_scorer │
                                      │   LLM evaluates rubric  │
                                      └─────────────────────────┘

Per-Sample Metadata Format

Task 1 outputs a JSON file for each sample, which Task 2 reads via load_manifest_dataset:

submissions/
  abc123.json
  abc123.tar.gz
  def456.json
  def456.tar.gz

Each JSON file contains a SubmissionMetadata object:

{
  "sample_id": "abc123",
  "paper_id": "arxiv_2301_001",
  "excluded_files": ["large_model.bin", "data/huge.csv"]
}

excluded_files: Files larger than 10MB that were excluded from the tarball

This format is concurrency-safe (no shared file) and retry-safe (overwrites instead of duplicates).