PaperBench Scoring Design

This document describes the architecture for scoring PaperBench submissions in a clean sandbox environment.

Overview

The scorer includes 3 components:

  1. Blacklist URL monitor - Check that the model doesn’t access any blacklisted URLs during solving
  2. Reproduction - Run the reproduce.sh script that the model generated in a fresh environment
  3. LLM judge - Take the reproduction results and score them according to the rubric

Problem

The current implementation runs reproduce.sh in the same sandbox where the agent worked. This means the reproduction might accidentally depend on:

  • System packages the agent installed globally
  • Files created outside the submission directory
  • Environment variables or state left behind

We want to verify that reproduce.sh works on a fresh machine with the same specs.

Solution

Split scoring into two tasks:

  1. Task 1: Agent solves the paper, exports submission
  2. Task 2: Fresh sandbox loads submission, runs reproduction, checks blacklist, judges results

Components

Component Type Task Purpose
paperbench_solver Solver Task 1 Agent reproduces the paper
save_submission Cleanup Task 1 Export tarball + JSON metadata
load_submission Setup Task 2 Unpack tarball to sandbox
reproduce Solver Task 2 Run reproduce.sh
blacklist_url_monitor Scorer Task 2 Check agent didn’t access blacklisted URLs
paperbench_judge_scorer Scorer Task 2 LLM evaluates against rubric

Task Composition

# Task 1: Agent + Export
Task(
    solver=paperbench_solver(),
    cleanup=save_submission(submissions_output_dir="./submissions"),
)

# Task 2: Load + Reproduce + Judge
Task(
    dataset=load_manifest_dataset("./submissions"),
    setup=load_submission(submissions_output_dir="./submissions"),
    solver=reproduce(),
    scorer=[
        blacklist_url_monitor(),
        paperbench_judge_scorer(),
    ],
)

Data Flow

Task 1                                Task 2
┌─────────────────────────┐           ┌─────────────────────────┐
│ paperbench_solver       │           │ load_submission         │
│   Agent works on paper  │           │   Unpack tarball        │
│                         │           │                         │
│ save_submission         │  tarball  │ reproduce               │
│   Export submission ────┼───────────┼─► Run reproduce.sh      │
│   Write JSON metadata   │  + JSON   │                         │
└─────────────────────────┘           │ paperbench_judge_scorer │
                                      │   LLM evaluates rubric  │
                                      └─────────────────────────┘

Per-Sample Metadata Format

Task 1 outputs a JSON file for each sample, which Task 2 reads via load_manifest_dataset:

submissions/
  abc123.json
  abc123.tar.gz
  def456.json
  def456.tar.gz

Each JSON file contains a SubmissionMetadata object:

{
  "sample_id": "abc123",
  "paper_id": "arxiv_2301_001",
  "excluded_files": ["large_model.bin", "data/huge.csv"]
}
  • excluded_files: Files larger than 10MB that were excluded from the tarball

This format is concurrency-safe (no shared file) and retry-safe (overwrites instead of duplicates).

Implementation Status